When does gradient descent with logistic loss find interpolating two-layer networks?

Chatterji, Niladri S., Long, Philip M., Bartlett, Peter L.

arXiv.org Machine Learning 

The success of deep learning models has led to a lot of recent interest in understanding the properties of "interpolating" neural network models, that achieve (near-)zero training loss [Zha 17a; Bel 19]. One aspect of understanding these models is to theoretically characterize how first-order gradient methods (with appropriate random initialization) seem to reliably find interpolating solutions to non-convex optimization problems. In this paper, we show that, under two sets of conditions, training fixed-width two-layer networks with gradient descent drives the logistic loss to zero. The networks have smooth "Huberized" ReLUs [Tat 20, see (1) and Figure 1] and the output weights are not trained. The first result only requires the assumption that the initial loss is small, but does not require any assumption about either the width of the network or the number of samples. It guarantees that if the initial loss is small then gradient descent drives the logistic loss to zero. For our second result we assume that the inputs come from four clusters, two per class, and that the clusters corresponding to the opposite labels are appropriately separated. Under these assumptions, we show that random Gaussian initialization along with a single step of gradient descent is enough to guarantee that the loss reduces sufficiently that the first result applies. A few proof ideas that facilitate our results are as follows: under our first set of assumptions, when the loss is small, we show that the negative gradient aligns well with the parameter vector. 1

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found