Gulcehre, Caglar, Moczulski, Marcin, Denil, Misha, Bengio, Yoshua

Common nonlinear activation functions used in neural networks can cause training difficulties due to the saturation behavior of the activation function, which may hide dependencies that are not visible to vanilla-SGD (using first order gradients only). Gating mechanisms that use softly saturating activation functions to emulate the discrete switching of digital logic circuits are good examples of this. We propose to exploit the injection of appropriate noise so that the gradients may flow easily, even if the noiseless application of the activation function would yield zero gradient. Large noise will dominate the noise-free gradient and allow stochastic gradient descent toexplore more. By adding noise only to the problematic parts of the activation function, we allow the optimization procedure to explore the boundary between the degenerate (saturating) and the well-behaved parts of the activation function. We also establish connections to simulated annealing, when the amount of noise is annealed down, making it easier to optimize hard objective functions. We find experimentally that replacing such saturating activation functions by noisy variants helps training in many contexts, yielding state-of-the-art or competitive results on different datasets and task, especially when training seems to be the most difficult, e.g., when curriculum learning is necessary to obtain good results.

Lengerich, Benjamin J., Konam, Sandeep, Xing, Eric P., Rosenthal, Stephanie, Veloso, Manuela

The predictive power of neural networks often costs model interpretability. Several techniques have been developed for explaining model outputs in terms of input features; however, it is difficult to translate such interpretations into actionable insight. Here, we propose a framework to analyze predictions in terms of the model's internal features by inspecting information flow through the network. Given a trained network and a test image, we select neurons by two metrics, both measured over a set of images created by perturbations to the input image: (1) magnitude of the correlation between the neuron activation and the network output and (2) precision of the neuron activation. We show that the former metric selects neurons that exert large influence over the network output while the latter metric selects neurons that activate on generalizable features. By comparing the sets of neurons selected by these two metrics, our framework suggests a way to investigate the internal attention mechanisms of convolutional neural networks.

Boopathy, Akhilan, Weng, Tsui-Wei, Chen, Pin-Yu, Liu, Sijia, Daniel, Luca

Verifying robustness of neural network classifiers has attracted great interests and attention due to the success of deep neural networks and their unexpected vulnerability to adversarial perturbations. Although finding minimum adversarial distortion of neural networks (with ReLU activations) has been shown to be an NP-complete problem, obtaining a non-trivial lower bound of minimum distortion as a provable robustness guarantee is possible. However, most previous works only focused on simple fully-connected layers (multilayer perceptrons) and were limited to ReLU activations. This motivates us to propose a general and efficient framework, CNN-Cert, that is capable of certifying robustness on general convolutional neural networks. Our framework is general -- we can handle various architectures including convolutional layers, max-pooling layers, batch normalization layer, residual blocks, as well as general activation functions; our approach is efficient -- by exploiting the special structure of convolutional layers, we achieve up to 17 and 11 times of speed-up compared to the state-of-the-art certification algorithms (e.g. Fast-Lin, CROWN) and 366 times of speed-up compared to the dual-LP approach while our algorithm obtains similar or even better verification bounds. In addition, CNN-Cert generalizes state-of-the-art algorithms e.g. Fast-Lin and CROWN. We demonstrate by extensive experiments that our method outperforms state-of-the-art lower-bound-based certification algorithms in terms of both bound quality and speed.

Blumenfeld, Yaniv, Gilboa, Dar, Soudry, Daniel

Reducing the precision of weights and activation functions in neural network training, with minimal impact on performance, is essential for the deployment of these models in resource-constrained environments. We apply mean-field techniques to networks with quantized activations in order to evaluate the degree to which quantization degrades signal propagation at initialization. We derive initialization schemes which maximize signal propagation in such networks and suggest why this is helpful for generalization. Building on these results, we obtain a closed form implicit equation for $L_{\max}$, the maximal trainable depth (and hence model capacity), given $N$, the number of quantization levels in the activation function. Solving this equation numerically, we obtain asymptotically: $L_{\max}\propto N^{1.82}$.

In past few years, linear rectified unit activation functions have shown its significance in the neural networks, surpassing the performance of sigmoid activations. RELU (Nair & Hinton, 2010), ELU (Clevert et al., 2015), PRELU (He et al., 2015), LRELU (Maas et al., 2013), SRELU (Jin et al., 2016), ThresholdedRELU, all these linear rectified activation functions have its own significance over others in some aspect. Most of the time these activation functions suffer from bias shift problem due to non-zero output mean, and high weight update problem in deep complex networks due to unit gradient, which results in slower training, and high variance in model prediction respectively. In this paper, we propose, "Thresholded exponential rectified linear unit" (TERELU) activation function that works better in alleviating in overfitting: large weight update problem. Along with alleviating overfitting problem, this method also gives good amount of non-linearity as compared to other linear rectifiers. We will show better performance on the various datasets using neural networks, considering TERELU activation method compared to other activations.