SGD Learns the Conjugate Kernel Class of the Network

Amit Daniely

Neural Information Processing Systems 

While stochastic gradient decent (SGD) from a random initialization is probably the most popular supervised learning algorithm today, we have very few results that depicts conditions that guarantee its success. Indeed, to the best of our knowledge, Andoni et al. [2014] provides the only known result of this form, and it is valid in a rather restricted setting. Namely, for depth-2 networks, where the underlying distribution is Gaussian, the algorithm is full gradient decent (rather than SGD), and the task is regression when the learnt function is a constant degree polynomial. We build on the framework of Daniely et al. [2016] to establish guarantees on SGD in a rather general setting. Daniely et al. [2016] defined a framework that associates a reproducing kernel to a network architecture.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found