Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs
Brutzkus, Alon, Globerson, Amir
Deep neural networks have achieved state-of-the-art performance on many machine learning tasks in areas such as natural language processing (Wu et al., 2016), computer vision (Krizhevsky et al., 2012) and speech recognition (Hinton et al., 2012). Training of such networks is often successfully performed by minimizing a high-dimensional non-convex objective function, using simple first-order methods such as stochastic gradient descent. Nonetheless, the success of deep learning from an optimization perspective is poorly understood theoretically. Current results are mostly pessimistic, suggesting that even training a 3-node neural network is NPhard (Blum & Rivest, 1993), and that the objective function of a single neuron can admit exponentially many local minima (Auer et al., 1996; Safran & Shamir, 2016). There have been recent attempts to bridge this gap between theory and practice.
Feb-25-2017