Reviews: Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data

Neural Information Processing Systems 

This paper studies learning over-parametrized single hidden layer ReLU neural networks for multi-class classification via SGD and the corresponding generalization error. They consider a mixture data distribution where each class has well-separated and compact support. The authors show SGD applied on the considered learning model achieves good prediction error with high probability under suitable assumptions. As a result even in severely over-parametrized models, SGD can generalize well although the network has enough capacity to fit arbitrary labels. The main insight in the theoretical analysis appears to be the observation that in the over-parametrized case, many ReLU neurons don't change their activation pattern when initialized randomly.