Stronger Convergence Results for Deep Residual Networks: Network Width Scales Linearly with Training Data Size

Gulcu, Talha Cihad

arXiv.org Machine Learning 

Deep neural networks have gained remarkable success over a l arge variety of applications, including computer vision [ 1 ], natural language processing [ 2 ], speech recognition [ 3 ] and Go games [ 4 ]. But the reason why deep networks perform well over various tasks is still not exactly understood. The optimization performance of deep networks is one of the subj ects which requires an involved theoretical study, given that gradient descent can achieve zero training loss even for random labels [ 5 ], and the loss of deep networks is highly non-convex. There are different lines of works investigating the optimization of deep networks from different perspec tives. For example, a large number of works consider the optimization landscape correspondin g to different activation functions [ 6 - 11 ], whereas some others [ 12 - 15 ] ensure global convergence by imposing some restrictions o n the input distribution. In the recent years, there has been considerably many papers providing convergence guarantees for over-parameterized two-layer and deep networks. It is s hown in [ 16 ] that gradient descent can find the near-global minima of a single hidden layer network i n polynomial time with respect to the accuracy and sample size.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found