Collaborating Authors

Deep linear neural networks with arbitrary loss: All local minima are global Machine Learning

We consider deep linear networks with arbitrary differentiable loss. We provide a short and elementary proof of the following fact: all local minima are global minima if each hidden layer is wider than either the input or output layer.

What is the Role of the Activation Function in a Neural Network?


Sorry if this is too trivial, but let me start at the "very beginning:" Linear regression. The goal of (ordinary least-squares) linear regression is to find the optimal weights that -- when linearly combined with the inputs -- result in a model that minimizes the vertical offsets between the target and explanatory variables, but let's not get distracted by model fitting, which is a different topic;). So, in linear regression, we compute a linear combination of weights and inputs (let's call this function the "net input function"). Next, let's consider logistic regression. Here, we put the net input z through a non-linear "activation function" -- the logistic sigmoid function where.

Computing Linear Restrictions of Neural Networks

Neural Information Processing Systems

A linear restriction of a function is the same function with its domain restricted to points on a given line. This paper addresses the problem of computing a succinct representation for a linear restriction of a piecewise-linear neural network. This primitive, which we call ExactLine, allows us to exactly characterize the result of applying the network to all of the infinitely many points on a line. In particular, ExactLine computes a partitioning of the given input line segment such that the network is affine on each partition. We present an efficient algorithm for computing ExactLine for networks that use ReLU, MaxPool, batch normalization, fully-connected, convolutional, and other layers, along with several applications.

Deep orthogonal linear networks are shallow Machine Learning

How can deep neural networks generalize, when they often have many more parameters than training samples? The culprit might be the training method, gradient descent, which should be implicitly biased towards good local minima that generalize well. In order to gain better understanding of the dynamics of neural networks trained by gradient descent, one can consider deep linear networks, which are a concatenation of linear transforms, without any non-linearity in between [8, 6, 5, 2, 7, 3]. This gives an interesting theoretical framework where there is hope to analyse precisely the behavior of gradient descent. In this work, we consider deep orthogonal linear networks, which are deep linear networks where each linear transform is constrained to be orthogonal. The set of orthogonal matrices is a Riemannian manifold, hence the training is performed with Riemannian gradient descent. We show that training any such network with Riemannian gradient descent is exactly equivalent to training a shallow one-layer neural network, hence fully explaining the role (or lack thereof) of depth in such models.

Linear Regression using PyTorch


As we know, 'Data is the new oil.' It means that just like oil. If one knows the value of data, we can learn to extract and use, it can solve many problems. Now, data can be explained by two things, Model and Error. In this article, we are going to dive into the linear model.