Layer normalization (LayerNorm) is a technique to normalize the distributions of intermediate layers. It enables smoother gradients, faster training, and better generalization accuracy. However, it is still unclear where the effectiveness stems from. In this paper, our main contribution is to take a step further in understanding LayerNorm. Many of previous studies believe that the success of LayerNorm comes from forward normalization.
Differentially private stochastic gradient descent (DPSGD) is a variation of stochastic gradient descent based on the Differential Privacy (DP) paradigm which can mitigate privacy threats arising from the presence of sensitive information in training data. One major drawback of training deep neural networks with DPSGD is a reduction in the model's accuracy. In this paper, we propose an alternative method for preserving data privacy based on introducing noise through learnable probability distributions, which leads to a significant improvement in the utility of the resulting private models. We also demonstrate that normalization layers have a large beneficial impact on the performance of deep neural networks with noisy parameters. In particular, we show that contrary to general belief, a large amount of random noise can be added to the weights of neural networks without harming the performance, once the networks are augmented with normalization layers. We hypothesize that this robustness is a consequence of the scale invariance property of normalization operators. Building on these observations, we propose a new algorithmic technique for training deep neural networks under very low privacy budgets by sampling weights from Gaussian distributions and utilizing batch or layer normalization techniques to prevent performance degradation. Our method outperforms previous approaches, including DPSGD, by a substantial margin on a comprehensive set of experiments on Computer Vision and Natural Language Processing tasks. In particular, we obtain a 20 percent accuracy improvement over DPSGD on the MNIST and CIFAR10 datasets with DP-privacy budgets of $\varepsilon = 0.05$ and $\varepsilon = 2.0$, respectively. Our code is available online: https://github.com/uds-lsv/SIDP.
Neural networks with binary weights are computation-efficient and hardware-friendly, but their training is challenging because it involves a discrete optimization problem. Surprisingly, ignoring the discrete nature of the problem and using gradient-based methods, such as Straight-Through Estimator, still works well in practice. This raises the question: are there principled approaches which justify such methods? In this paper, we propose such an approach using the Bayesian learning rule. The rule, when applied to estimate a Bernoulli distribution over the binary weights, results in an algorithm which justifies some of the algorithmic choices made by the previous approaches. The algorithm not only obtains state-of-the-art performance, but also enables uncertainty estimation for continual learning to avoid catastrophic forgetting. Our work provides a principled approach for training binary neural networks which justifies and extends existing approaches.
The Batch Normalization paper published back in 2015 by Sergey Ioffe, Christian Szegedy took the deep learning community by storm. It became one of the most implemented techniques in deep learning after it was released. Notably, its ability to accelerate training of deep learning models and achieve the same accuracy in 14 times fewer training steps was a great catch. Indeed that brought in the attention which it gets today (who doesn't want to train faster?). And there's been a lot of similar papers like layer normalization, instance normalization and a few others.
The lack of transparency of neural networks stays a major break for their use. The Layerwise Relevance Propagation technique builds heat-maps representing the relevance of each input in the model s decision. The relevance spreads backward from the last to the first layer of the Deep Neural Network. Layer-wise Relevance Propagation does not manage normalization layers, in this work we suggest a method to include normalization layers. Specifically, we build an equivalent network fusing normalization layers and convolutional or fully connected layers. Heatmaps obtained with our method on MNIST and CIFAR 10 datasets are more accurate for convolutional layers. Our study also prevents from using Layerwise Relevance Propagation with networks including a combination of connected layers and normalization layer.