When does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations?
Chatterji, Niladri S., Long, Philip M., Bartlett, Peter L.
–arXiv.org Artificial Intelligence
Interest in the properties of interpolating deep learning m odels trained with first-order optimization methods is surging [ Zha 17a; Bel 19 ]. One important question is to understand how gradient descent with appropriate random initialization r outinely finds interpolating (near-zero training loss) solutions to these non-convex optimization problems. In this paper our focus is to understand when gradient descen t drives the logistic loss to zero when applied to fixed-width deep networks using smooth a pproximations to the ReLU activation function. We derive upper bounds on the rate of co nvergence under two conditions. The first result only requires that the initial loss is small, but does not require any assumption about the width of the network. It guarantees that if the init ial loss is small then gradient descent drives the logistic loss down to zero.
arXiv.org Artificial Intelligence
Feb-9-2021
- Country:
- North America > United States
- Montana (0.04)
- California > Alameda County
- Berkeley (0.04)
- Europe > United Kingdom
- England > Cambridgeshire > Cambridge (0.04)
- North America > United States
- Genre:
- Research Report (0.49)
- Technology: