Sorry if this is too trivial, but let me start at the "very beginning:" Linear regression. The goal of (ordinary least-squares) linear regression is to find the optimal weights that -- when linearly combined with the inputs -- result in a model that minimizes the vertical offsets between the target and explanatory variables, but let's not get distracted by model fitting, which is a different topic;). So, in linear regression, we compute a linear combination of weights and inputs (let's call this function the "net input function"). Next, let's consider logistic regression. Here, we put the net input z through a non-linear "activation function" -- the logistic sigmoid function where.

Sotoudeh, Matthew, Thakur, Aditya V.

A linear restriction of a function is the same function with its domain restricted to points on a given line. This paper addresses the problem of computing a succinct representation for a linear restriction of a piecewise-linear neural network. This primitive, which we call ExactLine, allows us to exactly characterize the result of applying the network to all of the infinitely many points on a line. In particular, ExactLine computes a partitioning of the given input line segment such that the network is affine on each partition. We present an efficient algorithm for computing ExactLine for networks that use ReLU, MaxPool, batch normalization, fully-connected, convolutional, and other layers, along with several applications.

How can deep neural networks generalize, when they often have many more parameters than training samples? The culprit might be the training method, gradient descent, which should be implicitly biased towards good local minima that generalize well. In order to gain better understanding of the dynamics of neural networks trained by gradient descent, one can consider deep linear networks, which are a concatenation of linear transforms, without any non-linearity in between [8, 6, 5, 2, 7, 3]. This gives an interesting theoretical framework where there is hope to analyse precisely the behavior of gradient descent. In this work, we consider deep orthogonal linear networks, which are deep linear networks where each linear transform is constrained to be orthogonal. The set of orthogonal matrices is a Riemannian manifold, hence the training is performed with Riemannian gradient descent. We show that training any such network with Riemannian gradient descent is exactly equivalent to training a shallow one-layer neural network, hence fully explaining the role (or lack thereof) of depth in such models.