When does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations?

Chatterji, Niladri S., Long, Philip M., Bartlett, Peter L.

Feb-9-2021–arXiv.org Artificial Intelligence

Interest in the properties of interpolating deep learning m odels trained with first-order optimization methods is surging [ Zha 17a; Bel 19 ]. One important question is to understand how gradient descent with appropriate random initialization r outinely finds interpolating (near-zero training loss) solutions to these non-convex optimization problems. In this paper our focus is to understand when gradient descen t drives the logistic loss to zero when applied to fixed-width deep networks using smooth a pproximations to the ReLU activation function. We derive upper bounds on the rate of co nvergence under two conditions. The first result only requires that the initial loss is small, but does not require any assumption about the width of the network. It guarantees that if the init ial loss is small then gradient descent drives the logistic loss down to zero.

nullnull, nullv, probability, (15 more...)

arXiv.org Artificial Intelligence

Feb-9-2021

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Montana (0.04)
  - California > Alameda County
    - Berkeley (0.04)
- Europe > United Kingdom
  - England > Cambridgeshire > Cambridge (0.04)

Genre:
- Research Report (0.49)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Machine Learning
    - Neural Networks (1.00)
    - Statistical Learning > Gradient Descent (0.90)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found