In computational networks, activation functions play a key role is defining the output of a node given an input or a set of inputs. In the field of Artificial Neural Networks (ANNs), the Sigmoid function is just that. It is a form of an activation function for artificial neurons. It is also called a transfer function but is not to be confused with a linear system's transfer function. There are several types of activation functions -- a list of which is available on Wikipedia.
With this post, we are starting the third Chapter -- Activation functions and their derivatives. How they are used in Deep Learning will be discussed later. The most important post in this chapter is the last one where we will talk about the Softmax activation function and its Jacobian which very few people talk about. These posts are very short. So, let us begin with the first Activation function, i.e., the Sigmoid function.
A sigmoid function is a mathematical function having a characteristic "S"-shaped curve or sigmoid curve. Special cases of the sigmoid function include the Gompertz curve (used in modeling systems that saturate at large values of x) and the ogee curve (used in the spillway of some dams). Sigmoid functions have domain of all real numbers, with return value monotonically increasing most often from 0 to 1 or alternatively from 1 to 1, depending on convention. A wide variety of sigmoid functions including the logistic and hyperbolic tangent functions have been used as the activation function of artificial neurons. Sigmoid curves are also common in statistics as cumulative distribution functions (which go from 0 to 1), such as the integrals of the logistic distribution, the normal distribution, and Student's t probability density functions.
Artificial neural networks (NNs) have become the de facto standard in machine learning. They allow learning highly nonlinear transformations in a plethora of applications. However, NNs usually only provide point estimates without systematically quantifying corresponding uncertainties. In this paper a novel approach towards fully Bayesian NNs is proposed, where training and predictions of a perceptron are performed within the Bayesian inference framework in closed-form. The weights and the predictions of the perceptron are considered Gaussian random variables. Analytical expressions for predicting the perceptron's output and for learning the weights are provided for commonly used activation functions like sigmoid or ReLU. This approach requires no computationally expensive gradient calculations and further allows sequential learning.
Neural networks are hard to train. The more they go deep, the more they are likely to suffer from unstable gradients. Gradients can either explode or vanish, and neither of those is a good thing for the training of our network. The vanishing gradients problem results in the network taking too long to train(learning will be very slow or completely die), and the exploding gradients cause the gradients to be very large. Although those problems are nearly inevitable, the choice of activation function can reduce their effects. Using ReLU activation in the first layers can help avoid vanishing gradients.