I was experimenting with some CNN models and reading research material when I realized that it could happen that using only a single batch normalization layer at the early stages of the network could be beneficial compared to using a batch normalization layer after each convolutional layer (in case of CNNs). The inspiration came from the paper Comparison of feature learning methods for human activity recognition using wearable sensors by F. Li, K. Shirahama, M. A. Nisar, L. Koping, and M. Grzegorzek. I was wondering when and why does batch normalization hurt learning? Why using a single batch normalization instead of many may result in better learning?
With all the success of BN, it is amazing and disappointing at the same time that there are so many fantastic results but so little practical advice, how to actually implement the whole pipeline. No doubt, BN can be implemented pretty easy in the training part of the network, but that is not the whole story. Furthermore, there are, at least, two ways to use BN during training. First, with a running average for mean/std values per layer which can later be used for unseen data. Second, to calculate the mean/std values for each mini-batch and then run a separate step to fix the statistics for the data at the end of the training.
A BSTRACT Batch Normalization (BN) is one of the most widely used techniques in Deep Learning field. This weakness limits the usage of BN on many computer vision tasks like detection or segmentation, where batch size is usually small due to the constraint of memory consumption. Therefore many modified normalization techniques have been proposed, which either fail to restore the performance of BN completely, or have to introduce additional nonlinear operations in inference procedure and increase huge consumption. In this paper, we reveal that there are two extra batch statistics involved in backward propagation of BN, on which has never been well discussed before. The extra batch statistics associated with gradients also can severely affect the training of deep neural network. Based on our analysis, we propose a novel normalization method, named Moving Average Batch Normalization (MABN). MABN can completely restore the performance of vanilla BN in small batch cases, without introducing any additional nonlinear operations in inference procedure. We prove the benefits of MABN by both theoretical analysis and experiments. Our experiments demonstrate the effectiveness of MABN in multiple computer vision tasks including ImageNet and COCO. It has been widely proven effective in many applications, and become the indispensable part of many state of the art deep models.
The Batch Normalization paper published back in 2015 by Sergey Ioffe, Christian Szegedy took the deep learning community by storm. It became one of the most implemented techniques in deep learning after it was released. Notably, its ability to accelerate training of deep learning models and achieve the same accuracy in 14 times fewer training steps was a great catch. Indeed that brought in the attention which it gets today (who doesn't want to train faster?). And there's been a lot of similar papers like layer normalization, instance normalization and a few others.