to

### Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape

The permutation symmetry of neurons in each layer of a deep neural network gives rise not only to multiple equivalent global minima of the loss function, but also to first-order saddle points located on the path between the global minima. In a network of $d-1$ hidden layers with $n_k$ neurons in layers $k = 1, \ldots, d$, we construct smooth paths between equivalent global minima that lead through a permutation point' where the input and output weight vectors of two neurons in the same hidden layer $k$ collide and interchange. We show that such permutation points are critical points with at least $n_{k+1}$ vanishing eigenvalues of the Hessian matrix of second derivatives indicating a local plateau of the loss function. We find that a permutation point for the exchange of neurons $i$ and $j$ transits into a flat valley (or generally, an extended plateau of $n_{k+1}$ flat dimensions) that enables all $n_k!$ permutations of neurons in a given layer $k$ at the same loss value. Moreover, we introduce high-order permutation points by exploiting the recursive structure in neural network functions, and find that the number of $K^{\text{th}}$-order permutation points is at least by a factor $\sum_{k=1}^{d-1}\frac{1}{2!^K}{n_k-K \choose K}$ larger than the (already huge) number of equivalent global minima. In two tasks, we illustrate numerically that some of the permutation points correspond to first-order saddles (permutation saddles'): first, in a toy network with a single hidden layer on a function approximation task and, second, in a multilayer network on the MNIST task. Our geometric approach yields a lower bound on the number of critical points generated by weight-space symmetries and provides a simple intuitive link between previous mathematical results and numerical observations.

### The Upper Bound on Knots in Neural Networks

Neural networks with rectified linear unit activations are essentially multivariate linear splines. As such, one of many ways to measure the "complexity" or "expressivity" of a neural network is to count the number of knots in the spline model. We study the number of knots in fully-connected feedforward neural networks with rectified linear unit activation functions. We intentionally keep the neural networks very simple, so as to make theoretical analyses more approachable. An induction on the number of layers $l$ reveals a tight upper bound on the number of knots in $\mathbb{R} \to \mathbb{R}^p$ deep neural networks. With $n_i \gg 1$ neurons in layer $i = 1, \dots, l$, the upper bound is approximately $n_1 \dots n_l$. We then show that the exact upper bound is tight, and we demonstrate the upper bound with an example. The purpose of these analyses is to pave a path for understanding the behavior of general $\mathbb{R}^q \to \mathbb{R}^p$ neural networks.

### Tangent Space Sensitivity and Distribution of Linear Regions in ReLU Networks

Recent articles indicate that deep neural networks are efficient models for various learning problems. However they are often highly sensitive to various changes that cannot be detected by an independent observer. As our understanding of deep neural networks with traditional generalization bounds still remains incomplete, there are several measures which capture the behaviour of the model in case of small changes at a specific state. In this paper we consider adversarial stability in the tangent space and suggest tangent sensitivity in order to characterize stability. We focus on a particular kind of stability with respect to changes in parameters that are induced by individual examples without known labels. We derive several easily computable bounds and empirical measures for feed-forward fully connected ReLU (Rectified Linear Unit) networks and connect tangent sensitivity to the distribution of the activation regions in the input space realized by the network. Our experiments suggest that even simple bounds and measures are associated with the empirical generalization gap.

### A simple and efficient architecture for trainable activation functions

Learning automatically the best activation function for the task is an active topic in neural network research. At the moment, despite promising results, it is still difficult to determine a method for learning an activation function that is at the same time theoretically simple and easy to implement. Moreover, most of the methods proposed so far introduce new parameters or adopt different learning techniques. In this work we propose a simple method to obtain trained activation function which adds to the neural network local subnetworks with a small amount of neurons. Experiments show that this approach could lead to better result with respect to using a pre-defined activation function, without introducing a large amount of extra parameters that need to be learned.

### How Neural Nets Work

How Neural Nets Work Alan Lapedes Robert Farber Theoretical Division Los Alamos National Laboratory Los Alamos, NM 87545 Abstract: There is presently great interest in the abilities of neural networks to mimic "qualitative reasoning" by manipulating neural incodings of symbols. Less work has been performed on using neural networks to process floating point numbers and it is sometimes stated that neural networks are somehow inherently inaccurate and therefore best suited for "fuzzy" qualitative reasoning. Nevertheless, the potential speed of massively parallel operations make neural net "number crunching" an interesting topic to explore. In this paper we discuss some of our work in which we demonstrate that for certain applications neural networks can achieve significantly higher numerical accuracy than more conventional techniques. In particular, prediction of future values of a chaotic time series can be performed with exceptionally high accuracy. We analyze how a neural net is able to do this, and in the process show that a large class of functions from Rn. Rffl may be accurately approximated by a backpropagation neural net with just two "hidden" layers. The network uses this functional approximation to perform either interpolation (signal processing applications) or extrapolation (symbol processing applicationsJ.