Brutzkus, Alon
How Uniform Random Weights Induce Non-uniform Bias: Typical Interpolating Neural Networks Generalize with Narrow Teachers
Buzaglo, Gon, Harel, Itamar, Nacson, Mor Shpigel, Brutzkus, Alon, Srebro, Nathan, Soudry, Daniel
Background. A main theoretical puzzle is why over-parameterized Neural Networks (NNs) generalize well when trained to zero loss (i.e., so they interpolate the data). Usually, the NN is trained with Stochastic Gradient Descent (SGD) or one of its variants. However, recent empirical work examined the generalization of a random NN that interpolates the data: the NN was sampled from a seemingly uniform prior over the parameters, conditioned on that the NN perfectly classifying the training set. Interestingly, such a NN sample typically generalized as well as SGD-trained NNs. Contributions. We prove that such a random NN interpolator typically generalizes well if there exists an underlying narrow ``teacher NN" that agrees with the labels. Specifically, we show that such a `flat' prior over the NN parametrization induces a rich prior over the NN functions, due to the redundancy in the NN structure. In particular, this creates a bias towards simpler functions, which require less relevant parameters to represent -- enabling learning with a sample complexity approximately proportional to the complexity of the teacher (roughly, the number of non-redundant parameters), rather than the student's.
On the Inductive Bias of a CNN for Orthogonal Patterns Distributions
Brutzkus, Alon, Globerson, Amir
Training overparameterized convolutional neural networks with gradient based methods is the most successful learning method for image classification. However, its theoretical properties are far from understood even for very simple learning tasks. In this work, we consider a simplified image classification task where images contain orthogonal patches and are learned with a 3-layer overparameterized convolutional network and stochastic gradient descent. We empirically identify a novel phenomenon where the dot-product between the learned pattern detectors and their detected patterns are governed by the pattern statistics in the training set. We call this phenomenon Pattern Statistics Inductive Bias (PSI) and prove that PSI holds for a simple setup with two points in the training set. Furthermore, we prove that if PSI holds, stochastic gradient descent has sample complexity $O(d^2\log(d))$ where $d$ is the filter dimension. In contrast, we show a VC dimension lower bound in our setting which is exponential in $d$. Taken together, our results provide strong evidence that PSI is a unique inductive bias of stochastic gradient descent, that guarantees good generalization properties.
On the Optimality of Trees Generated by ID3
Brutzkus, Alon, Daniely, Amit, Malach, Eran
Since its inception in the 1980s, ID3 has become one of the most successful and widely used algorithms for learning decision trees. However, its theoretical properties remain poorly understood. In this work, we analyze the heuristic of growing a decision tree with ID3 for a limited number of iterations $t$ and given that nodes are split as in the case of exact information gain and probability computations. In several settings, we provide theoretical and empirical evidence that the TopDown variant of ID3, introduced by Kearns and Mansour (1996), produces trees with optimal or near-optimal test error among all trees with $t$ internal nodes. We prove optimality in the case of learning conjunctions under product distributions and learning read-once DNFs with 2 terms under the uniform distribition. Using efficient dynamic programming algorithms, we empirically show that TopDown generates trees that are near-optimal ($\sim \%1$ difference from optimal test error) in a large number of settings for learning read-once DNFs under product distributions.
ID3 Learns Juntas for Smoothed Product Distributions
Brutzkus, Alon, Daniely, Amit, Malach, Eran
In recent years, there are many attempts to understand popular heuristics. An example of such a heuristic algorithm is the ID3 algorithm for learning decision trees. This algorithm is commonly used in practice, but there are very few theoretical works studying its behavior. In this paper, we analyze the ID3 algorithm, when the target function is a $k$-Junta, a function that depends on $k$ out of $n$ variables of the input. We prove that when $k = \log n$, the ID3 algorithm learns in polynomial time $k$-Juntas, in the smoothed analysis model of Kalai & Teng. That is, we show a learnability result when the observed distribution is a "noisy" variant of the original distribution.
Low Latency Privacy Preserving Inference
Brutzkus, Alon, Elisha, Oren, Gilad-Bachrach, Ran
When applying machine learning to sensitive data, one has to balance between accuracy, information leakage, and computational-complexity. Recent studies have shown that Homomorphic Encryption (HE) can be used for protecting against information leakage while applying neural networks. However, this comes with the cost of limiting the width and depth of neural networks that can be used (and hence the accuracy) and with latency of the order of several minutes even for relatively simple networks. In this study we provide two solutions that address these limitations. In the first solution, we present more than $10\times$ improvement in latency and enable inference on wider networks compared to prior attempts with the same level of security. The improved performance is achieved via a collection of methods to better represent the data during the computation. In the second solution, we apply the method of transfer learning to provide private inference services using deep networks with latency less than 0.2 seconds. We demonstrate the efficacy of our methods on several computer vision tasks.
Over-parameterization Improves Generalization in the XOR Detection Problem
Brutzkus, Alon, Globerson, Amir
Most successful deep learning models use a number of parameters that is larger than the number of parameters that are needed to get zero-training error. This is typically referred to as overparameterization. Indeed, it can be argued that over-parameterization is one of the key techniques that has led to the remarkable success of neural networks. However, there is still no theoretical account for its effectiveness. One very intriguing observation in this context is that over-parameterized networks with ReLU activations often exhibit better generalization error than smaller networks (Neyshabur et al., 2014, 2018; Novak et al., 2018). This somewhat counterintuitive observation suggests that over-parameterized networks have an inductive bias towards solutions with better generalization performance. Understanding this inductive bias is a major theoretical challenge and is a necessary step towards a full understanding of neural networks in practice. To better understand this phenomenon, it is crucial to be able to reason about optimization and generalization properties of over-parameterized networks with ReLU activations, which are trained with gradient based methods. However, in the current state of affairs, these are far from understood. In fact, there do not exist optimization or generalization guarantees for these networks even in very simple learning tasks such as the classic XOR problem.
Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs
Brutzkus, Alon, Globerson, Amir
Deep learning models are often successfully trained using gradient descent, despite the worst case hardness of the underlying non-convex optimization problem. The key question is then under what conditions can one prove that optimization will succeed. Here we provide a strong result of this kind. We consider a neural net with one hidden layer and a convolutional structure with no overlap and a ReLU activation function. For this architecture we show that learning is NP-complete in the general case, but that when the input distribution is Gaussian, gradient descent converges to the global optimum in polynomial time. To the best of our knowledge, this is the first global optimality guarantee of gradient descent on a convolutional neural network with ReLU activations.