Both represent the same principle, but for our purposes it's easier to explain using the geometric definition. In geometry slope represents the steepness of a line. It answers the question: how much does or change given a specific change in? But what if I asked you, instead of the slope between two points, what is the slope at a single point on the line? In this case there isn't any obvious "rise over run" to calculate.

Lots of statistics and machine learning involves turning a bunch of data into new numbers to make good decisions. For example, a data scientist might use your past bids on a Google search term, and the results, to work out the expected return on investment (ROI) for new bids. Armed with this knowledge you can make an informed decision about how much to bid in the future. Cool, but what if those ROIs are wrong? Luckily, data scientists don't just guess these numbers! They use data to generate a reasoned number for them.

Understanding the concept of the gradient is useful for understanding the logic of the gradient descent algorithm. Let's take a look at the explanation of the concept of stationary point in Wikipedia. As it can be understood from here, the gradient descent algorithm takes the points in the cost function and continues with the aim of reducing the derivative (slope) of these points in each iteration. The reason for this is to find the value whose slope is zero, in other words, the minimum point. When the coordinate values of this point are substituted in the hypothesis function, the function we obtain becomes the hypothesis function of the model with the least error we can create.

In general, the overall performance of a neural network depends on several factors. The one usually taking the spotlight is the network architecture, however, this is only one among many important components. An often overlooked contributor to a performant algorithm is the optimizer, which is used to fit the model. Just to illustrate the complexity of optimizing, a ResNet18 architecture has 11689512 parameters. Finding an optimal parameter configuration is locating a point in the 11689512 dimensional space. If we were to brute force this, we might decide to divide this space up to a grid, say we select 10 points along each dimension.

In this post, we have looked at the batch gradient descent, the need to develop new optimization techniques, and then we briefly discussed how to interpret contour plots. After that, we have looked at behind six different optimization techniques and three different data strategies (batch, mini-batch & stochastic) with an intuitive understanding which helps to know where to use any of these algorithms. In Practice Adam optimizer with mini-batch of sizes 32, 64 and 128 is the default choice, at least for all the image classification tasks which deal with CNN and large sequence to sequence models.