AITopics | Brutzkus, Alon

Collaborating Authors

Brutzkus, Alon

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

How Uniform Random Weights Induce Non-uniform Bias: Typical Interpolating Neural Networks Generalize with Narrow Teachers

Buzaglo, Gon, Harel, Itamar, Nacson, Mor Shpigel, Brutzkus, Alon, Srebro, Nathan, Soudry, Daniel

arXiv.org Artificial IntelligenceFeb-9-2024

Background. A main theoretical puzzle is why over-parameterized Neural Networks (NNs) generalize well when trained to zero loss (i.e., so they interpolate the data). Usually, the NN is trained with Stochastic Gradient Descent (SGD) or one of its variants. However, recent empirical work examined the generalization of a random NN that interpolates the data: the NN was sampled from a seemingly uniform prior over the parameters, conditioned on that the NN perfectly classifying the training set. Interestingly, such a NN sample typically generalized as well as SGD-trained NNs. Contributions. We prove that such a random NN interpolator typically generalizes well if there exists an underlying narrow ``teacher NN" that agrees with the labels. Specifically, we show that such a `flat' prior over the NN parametrization induces a rich prior over the NN functions, due to the redundancy in the NN structure. In particular, this creates a bias towards simpler functions, which require less relevant parameters to represent -- enabling learning with a sample complexity approximately proportional to the complexity of the teacher (roughly, the number of non-redundant parameters), rather than the student's.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2402.06323

Country:

North America > United States (0.46)
Asia > Middle East > Israel (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback

On the Inductive Bias of a CNN for Orthogonal Patterns Distributions

Brutzkus, Alon, Globerson, Amir

arXiv.org Machine LearningFeb-22-2020

Training overparameterized convolutional neural networks with gradient based methods is the most successful learning method for image classification. However, its theoretical properties are far from understood even for very simple learning tasks. In this work, we consider a simplified image classification task where images contain orthogonal patches and are learned with a 3-layer overparameterized convolutional network and stochastic gradient descent. We empirically identify a novel phenomenon where the dot-product between the learned pattern detectors and their detected patterns are governed by the pattern statistics in the training set. We call this phenomenon Pattern Statistics Inductive Bias (PSI) and prove that PSI holds for a simple setup with two points in the training set. Furthermore, we prove that if PSI holds, stochastic gradient descent has sample complexity $O(d^2\log(d))$ where $d$ is the filter dimension. In contrast, we show a VC dimension lower bound in our setting which is exponential in $d$. Taken together, our results provide strong evidence that PSI is a unique inductive bias of stochastic gradient descent, that guarantees good generalization properties.

artificial intelligence, inductive bias, neural network, (18 more...)

arXiv.org Machine Learning

2002.09781

Country: Europe (0.14)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

On the Optimality of Trees Generated by ID3

Brutzkus, Alon, Daniely, Amit, Malach, Eran

arXiv.org Machine LearningJul-11-2019

Since its inception in the 1980s, ID3 has become one of the most successful and widely used algorithms for learning decision trees. However, its theoretical properties remain poorly understood. In this work, we analyze the heuristic of growing a decision tree with ID3 for a limited number of iterations $t$ and given that nodes are split as in the case of exact information gain and probability computations. In several settings, we provide theoretical and empirical evidence that the TopDown variant of ID3, introduced by Kearns and Mansour (1996), produces trees with optimal or near-optimal test error among all trees with $t$ internal nodes. We prove optimality in the case of learning conjunctions under product distributions and learning read-once DNFs with 2 terms under the uniform distribition. Using efficient dynamic programming algorithms, we empirically show that TopDown generates trees that are near-optimal ($\sim \%1$ difference from optimal test error) in a large number of settings for learning read-once DNFs under product distributions.

artificial intelligence, decision tree learning, node, (18 more...)

arXiv.org Machine Learning

1907.05444

Country: Asia > Middle East > Israel (0.14)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Add feedback

ID3 Learns Juntas for Smoothed Product Distributions

Brutzkus, Alon, Daniely, Amit, Malach, Eran

arXiv.org Machine LearningJun-20-2019

In recent years, there are many attempts to understand popular heuristics. An example of such a heuristic algorithm is the ID3 algorithm for learning decision trees. This algorithm is commonly used in practice, but there are very few theoretical works studying its behavior. In this paper, we analyze the ID3 algorithm, when the target function is a $k$-Junta, a function that depends on $k$ out of $n$ variables of the input. We prove that when $k = \log n$, the ID3 algorithm learns in polynomial time $k$-Juntas, in the smoothed analysis model of Kalai & Teng. That is, we show a learnability result when the observed distribution is a "noisy" variant of the original distribution.

learn junta, product distribution

arXiv.org Machine Learning

1906.08654

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Add feedback

Low Latency Privacy Preserving Inference

Brutzkus, Alon, Elisha, Oren, Gilad-Bachrach, Ran

arXiv.org Machine LearningDec-27-2018

When applying machine learning to sensitive data, one has to balance between accuracy, information leakage, and computational-complexity. Recent studies have shown that Homomorphic Encryption (HE) can be used for protecting against information leakage while applying neural networks. However, this comes with the cost of limiting the width and depth of neural networks that can be used (and hence the accuracy) and with latency of the order of several minutes even for relatively simple networks. In this study we provide two solutions that address these limitations. In the first solution, we present more than $10\times$ improvement in latency and enable inference on wider networks compared to prior attempts with the same level of security. The improved performance is achieved via a collection of methods to better represent the data during the computation. In the second solution, we apply the method of transfer learning to provide private inference services using deep networks with latency less than 0.2 seconds. We demonstrate the efficacy of our methods on several computer vision tasks.

artificial intelligence, neural network, representation, (18 more...)

arXiv.org Machine Learning

1812.10659

Country: Asia (0.14)

Genre: Research Report > New Finding (0.48)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Over-parameterization Improves Generalization in the XOR Detection Problem

Brutzkus, Alon, Globerson, Amir

arXiv.org Machine LearningOct-6-2018

Most successful deep learning models use a number of parameters that is larger than the number of parameters that are needed to get zero-training error. This is typically referred to as overparameterization. Indeed, it can be argued that over-parameterization is one of the key techniques that has led to the remarkable success of neural networks. However, there is still no theoretical account for its effectiveness. One very intriguing observation in this context is that over-parameterized networks with ReLU activations often exhibit better generalization error than smaller networks (Neyshabur et al., 2014, 2018; Novak et al., 2018). This somewhat counterintuitive observation suggests that over-parameterized networks have an inductive bias towards solutions with better generalization performance. Understanding this inductive bias is a major theoretical challenge and is a necessary step towards a full understanding of neural networks in practice. To better understand this phenomenon, it is crucial to be able to reason about optimization and generalization properties of over-parameterized networks with ReLU activations, which are trained with gradient based methods. However, in the current state of affairs, these are far from understood. In fact, there do not exist optimization or generalization guarantees for these networks even in very simple learning tasks such as the classic XOR problem.

deep learning, lemma, neural network, (22 more...)

arXiv.org Machine Learning

1810.03037

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs

Brutzkus, Alon, Globerson, Amir

arXiv.org Machine LearningFeb-25-2017

Deep learning models are often successfully trained using gradient descent, despite the worst case hardness of the underlying non-convex optimization problem. The key question is then under what conditions can one prove that optimization will succeed. Here we provide a strong result of this kind. We consider a neural net with one hidden layer and a convolutional structure with no overlap and a ReLU activation function. For this architecture we show that learning is NP-complete in the general case, but that when the input distribution is Gaussian, gradient descent converges to the global optimum in polynomial time. To the best of our knowledge, this is the first global optimality guarantee of gradient descent on a convolutional neural network with ReLU activations.

deep learning, gradient descent, neural network, (17 more...)

arXiv.org Machine Learning

1702.07966

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback