Training Deep Networks without Learning Rates Through Coin Betting
Orabona, Francesco, Tommasi, Tatiana
In the last years deep learning has demonstrated a great success in a large number of fields and has attracted the attention of various research communities with the consequent development of multiple coding frameworks (e.g., Caffe [Jia et al., 2014], TensorFlow [Abadi et al., 2015]), the diffusion of blogs, online tutorials, books, and dedicated courses. Besides reaching out scientists with different backgrounds, the need of all these supportive tools originates also from the nature of deep learning: it is a methodology that involves many structural details as well as several hyperparameters whose importance has been growing with the recent trend of designing deeper and multi-branches networks. Some of the hyperparameters define the model itself (e.g., number of hidden layers, regularization coefficients, kernel size for convolutional layers), while others are related to the model training procedure. In both cases, hyperparameter tuning is a critical step to realize deep learning full potential and most of the knowledge in this area comes from living practice, years of experimentation, and, to some extent, mathematical justification [Bengio, 2012]. With respect to the optimization process, stochastic gradient descent (SGD) has proved itself to be a key component of the deep learning success, but its effectiveness strictly depends on the choice of the initial learning rate and learning rate schedule. This has primed a line of research on algorithms to reduce the hyperparameter dependence in SGD--see Section 2 for an overview on the related literature.
Nov-4-2017
- Country:
- North America > United States (0.28)
- Genre:
- Research Report (0.50)
- Industry:
- Education > Educational Setting > Online (0.48)
- Technology: