When does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations?

Chatterji, Niladri S., Long, Philip M., Bartlett, Peter L.

arXiv.org Artificial Intelligence 

Interest in the properties of interpolating deep learning m odels trained with first-order optimization methods is surging [ Zha 17a; Bel 19 ]. One important question is to understand how gradient descent with appropriate random initialization r outinely finds interpolating (near-zero training loss) solutions to these non-convex optimization problems. In this paper our focus is to understand when gradient descen t drives the logistic loss to zero when applied to fixed-width deep networks using smooth a pproximations to the ReLU activation function. We derive upper bounds on the rate of co nvergence under two conditions. The first result only requires that the initial loss is small, but does not require any assumption about the width of the network. It guarantees that if the init ial loss is small then gradient descent drives the logistic loss down to zero.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found