I was experimenting with some CNN models and reading research material when I realized that it could happen that using only a single batch normalization layer at the early stages of the network could be beneficial compared to using a batch normalization layer after each convolutional layer (in case of CNNs). The inspiration came from the paper Comparison of feature learning methods for human activity recognition using wearable sensors by F. Li, K. Shirahama, M. A. Nisar, L. Koping, and M. Grzegorzek. I was wondering when and why does batch normalization hurt learning? Why using a single batch normalization instead of many may result in better learning?
Layer normalization (LayerNorm) is a technique to normalize the distributions of intermediate layers. It enables smoother gradients, faster training, and better generalization accuracy. However, it is still unclear where the effectiveness stems from. In this paper, our main contribution is to take a step further in understanding LayerNorm. Many of previous studies believe that the success of LayerNorm comes from forward normalization.
Online Normalization is a new technique for normalizing the hidden activations of a neural network. While Online Normalization does not use batches, it is as accurate as Batch Normalization. We resolve a theoretical limitation of Batch Normalization by introducing an unbiased technique for computing the gradient of normalized activations. Online Normalization works with automatic differentiation by adding statistical normalization as a primitive. This technique can be used in cases not covered by some other normalizers, such as recurrent networks, fully connected networks, and networks with activation memory requirements prohibitive for batching.
The most widely used technique providing wonders to performance. Well, Batch normalization is a normalization method that normalizes activations in a network across the mini-batch. It computes the mean and variance for each feature in a mini-batch. It then subtracts the mean and divides the feature by its mini-batch standard deviation. It also has two additional learnable parameters, the mean and magnitude of the activations.
Traditionally, multi-layer neural networks use dot product between the output vector of previous layer and the incoming weight vector as the input to activation function. The result of dot product is unbounded, thus increases the risk of large variance. Large variance of neuron makes the model sensitive to the change of input distribution, thus results in poor generalization, and aggravates the internal covariate shift which slows down the training. To bound dot product and decrease the variance, we propose to use cosine similarity or centered cosine similarity (Pearson Correlation Coefficient) instead of dot product in neural networks, which we call cosine normalization. We compare cosine normalization with batch, weight and layer normalization in fully-connected neural networks as well as convolutional networks on the data sets of MNIST, 20NEWS GROUP, CIFAR-10/100 and SVHN. Experiments show that cosine normalization achieves better performance than other normalization techniques.