Goto

Collaborating Authors

 batch normalization layer






6e3197aae95c2ff8fcab35cb730f6a86-Paper.pdf

Neural Information Processing Systems

Compared withconvolutional neural networks (CNNs), the training ofAdderNets ismuch more sophisticated including several techniques for adjusting gradient and batch normalization.




Efficient Training of Low-Curvature Neural Networks

Neural Information Processing Systems

Standard deep neural networks often have excess non-linearity, making them susceptible to issues such as low adversarial robustness and gradient instability. Common methods to address these downstream issues, such as adversarial training, are expensive and often sacrifice predictive accuracy. In this work, we address the core issue of excess non-linearity via curvature, and demonstrate low-curvature neural networks (LCNNs) that obtain drastically lower curvature than standard models while exhibiting similar predictive performance. This leads to improved robustness and stable gradients, at a fraction of the cost of standard adversarial training. To achieve this, we decompose overall model curvature in terms of curvatures and slopes of its constituent layers. To enable efficient curvature minimization of constituent layers, we introduce two novel architectural components: first, a non-linearity called centered-softplus that is a stable variant of the softplus non-linearity, and second, a Lipschitz-constrained batch normalization layer. Our experiments show that LCNNs have lower curvature, more stable gradients and increased off-the-shelf adversarial robustness when compared to standard neural networks, all without affecting predictive performance. Our approach is easy to use and can be readily incorporated into existing neural network architectures. Code to implement our method and replicate our experiments is available at https://github.com/kylematoba/lcnn.



A Definition of a batch normalization layer When applying batch normalization to convolutional layers, the inputs and outputs of normalization layers are 4-dimensional tensors, which we denote by I

Neural Information Processing Systems

For distributed training, the batch statistics are usually estimated locally on a subset of the training minibatch ("ghost batch normalization" [ We now define the three models in full. These inputs first pass through a single fully connected linear layer of width 1000. We then apply a series of residual blocks. LeCun normal initialization [48] to preserve the variance in the absence of non-linearities. We then apply a series of residual blocks.