Have you ever wondered which optimization algorithm to use for your Neural network Model to produce slightly better and faster results by updating the Model parameters such as Weights and Bias values . Should we use Gradient Descent or Stochastic gradient Descent or Adam? I too didn't know about the major differences between these different types of Optimization Strategies and which one is better over another before writing this article. Optimization algorithms helps us to minimize (or maximize) a Loss function (another name for Error function) E(x) which is simply a mathematical function dependent on the Model's internal parameters which are used in computing the target values(Y) from the set of predictors(X) used in the model. For example -- we call the Weights(W) and the Bias(b) values of the Neural Network as its internal parameters which are used in computing the Output values and play a major role in the training process of the Neural Network Model .
We describe an alternative learning method for neural networks, which we call Blind Descent. By design, Blind Descent does not face problems like exploding or vanishing gradients. In Blind Descent, gradients are not used to guide the learning process. In this paper, we present Blind Descent as a more fundamental learning process compared to gradient descent. We also show that gradient descent can be seen as a specific case of the Blind Descent algorithm. We also train two neural network architectures, a multilayer perceptron and a convolutional neural network, using the most general Blind Descent algorithm to demonstrate a proof of concept.
We propose a framework to perform streaming covariance selection. Our approach employs regularization constraints where a time-varying sparsity parameter is iteratively estimated via stochastic gradient descent. This allows for the regularization parameter to be efficiently learnt in an online manner. The proposed framework is developed for linear regression models and extended to graphical models via neighbourhood selection. Under mild assumptions, we are able to obtain convergence results in a non-stochastic setting. The capabilities of such an approach are demonstrated using both synthetic data as well as neuroimaging data.
In this post, we have looked at the batch gradient descent, the need to develop new optimization techniques, and then we briefly discussed how to interpret contour plots. After that, we have looked at behind six different optimization techniques and three different data strategies (batch, mini-batch & stochastic) with an intuitive understanding which helps to know where to use any of these algorithms. In Practice Adam optimizer with mini-batch of sizes 32, 64 and 128 is the default choice, at least for all the image classification tasks which deal with CNN and large sequence to sequence models.
Stochastic Gradient Descent (SGD) is a very powerful technique, currently employed to optimize all deep learning models. However, the vanilla algorithm has many limitations, in particular when the system is ill-conditioned and could never find the global minimum. In this post, we're going to analyze how it works and the most important variations that can speed up the convergence in deep models.