The weights of a neural network are typically initialized at random, and one can think of the functions produced by such a network as having been generated by a prior over some function space. Studying random networks, then, is useful for a Bayesian understanding of the network evolution in early stages of training. In particular, one can investigate why neural networks with huge numbers of parameters do not immediately overfit. We analyze the properties of random scalar-input feed-forward rectified linear unit architectures, which are random linear splines. With weights and biases sampled from certain common distributions, empirical tests show that the number of knots in the spline produced by the network is equal to the number of neurons, to very close approximation. We describe our progress towards a completely analytic explanation of this phenomenon. In particular, we show that random single-layer neural networks are equivalent to integrated random walks with variable step sizes. That each neuron produces one knot on average is equivalent to the associated integrated random walk having one zero crossing on average. We explore how properties of the integrated random walk, including the step sizes and initial conditions, affect the number of crossings. The number of knots in random neural networks can be related to the behavior of extreme learning machines, but it also establishes a prior preventing optimizers from immediately overfitting to noisy training data.
ReLU neural networks define piecewise linear functions of their inputs. However, initializing and training a neural network is very different from fitting a linear spline. In this paper, we expand empirically upon previous theoretical work to demonstrate features of trained neural networks. Standard network initialization and training produce networks vastly simpler than a naive parameter count would suggest and can impart odd features to the trained network. However, we also show the forced simplicity is beneficial and, indeed, critical for the wide success of these networks.
Artificial neural networks are often very complex and too deep for a human to understand. As a result, they are usually referred to as black boxes. For a lot of real-world problems, the underlying pattern itself is very complicated, such that an analytic solution does not exist. However, in some cases, laws of physics, for example, the pattern can be described by relatively simple mathematical expressions. In that case, we want to get a readable equation rather than a black box. In this paper, we try to find a way to explain a network and extract a human-readable equation that describes the model.
We propose to optimize the activation functions of a deep neural network by adding a corresponding functional regularization to the cost function. We justify the use of a second-order total-variation criterion. This allows us to derive a general representer theorem for deep neural networks that makes a direct connection with splines and sparsity. Specifically, we show that the optimal network configuration can be achieved with activation functions that are nonuniform linear splines with adaptive knots. The bottom line is that the action of each neuron is encoded by a spline whose parameters (including the number of knots) are optimized during the training procedure. The scheme results in a computational structure that is compatible with the existing deep-ReLU and MaxOut architectures. It also suggests novel optimization challenges, while making the link with $\ell_1$ minimization and sparsity-promoting techniques explicit.
We show how an "Elman" network architecture, constructed from recurrently connected oscillatory associative memory network modules, can employ selective "attentional" control of synchronization to direct the flow of communication and computation within the architecture to solve a grammatical inference problem. Previously we have shown how the discrete time "Elman" network algorithm can be implemented in a network completely described by continuous ordinary differential equations. The time steps (machine cycles) of the system are implemented by rhythmic variation (clocking) of a bifurcation parameter. In this architecture, oscillation amplitude codes the information content or activity of a module (unit), whereas phase and frequency are used to "softwire" the network. Only synchronized modules communicate by exchanging amplitude information; the activity of non-resonating modules contributes incoherent crosstalk noise. Attentional control is modeled as a special subset of the hidden modules with ouputs which affect the resonant frequencies of other hidden modules. They control synchrony among the other modules and direct the flow of computation (attention) to effect transitions between two subgraphs of a thirteen state automaton which the system emulates to generate a Reber grammar. The internal crosstalk noise is used to drive the required random transitions of the automaton.