This article serves as a good exercise to see how forward propagation works and then how the gradients are computed to implement the backpropagation algorithm. Also, the reader will get comfortable with the computation of vector, tensor derivatives and vector/matrix calculus. A useful document can be found here for the interested reader to get familiar with tensor operations.

Sometimes, you see a diagram and it gives you an'aha ha' moment I saw it on Frederick kratzert's blog Using the input variables x and y, The forwardpass (left half of the figure) calculates output z as a function of x and y i.e. f(x,y) The right side of the figures shows the backwardpass. Receiving dL/dz (the derivative of the total loss with respect to the output z), we can calculate the individual gradients of x and y on the loss function by applying the chain rule, as shown in the figure. This post is a part of my forthcoming book on Mathematical foundations of Data Science. The goal of the neural network is to minimise the loss function for the whole network of neurons. Hence, the problem of solving equations represented by the neural network also becomes a problem of minimising the loss function for the entire network.

Artificial Neural Networks (ANNs) are the basis for everything that is happening in today's world. They might be smaller or bigger in size depending on the application but they are always there. Artificial Intelligence, Machine Learning, and Deep Learning are all powered by the ANNs. Although we have heard this term a lot, we have little to no knowledge of what they actually are. But you surely want to learn more about them in an easily understandable way; So today, let's talk Neural Networks!

Neural network has attracted great attention for a long time and many researchers are devoted to improve the effectiveness of neural network training algorithms. Though stochastic gradient descent (SGD) and other explicit gradient-based methods are widely adopted, there are still many challenges such as gradient vanishing and small step sizes, which leads to slow convergence and instability of SGD algorithms. Motivated by error back propagation (BP) and proximal methods, we propose a semi-implicit back propagation method for neural network training. Similar to BP, the difference on the neurons are propagated in a backward fashion and the parameters are updated with proximal mapping. The implicit update for both hidden neurons and parameters allows to choose large step size in the training algorithm. Finally, we also show that any fixed point of convergent sequences produced by this algorithm is a stationary point of the objective loss function. The experiments on both MNIST and CIFAR-10 demonstrate that the proposed semi-implicit BP algorithm leads to better performance in terms of both loss decreasing and training/validation accuracy, compared to SGD and a similar algorithm ProxBP.

I prefer Option 2 and take that approach to learning any new topic. I might not be able to tell you the entire math behind an algorithm, but I can tell you the intuition. I can tell you the best scenarios to apply an algorithm based on my experiments and understanding. In my interactions with people, I find that people don't take time to develop this intuition and hence they struggle to apply things in the right manner. In this article, I will discuss the building block of a neural network from scratch and focus more on developing this intuition to apply Neural networks.