Malach, Eran
Decoupling "when to update" from "how to update"
Malach, Eran, Shalev-Shwartz, Shai
A useful approach to obtain data is to be creative and mine data from various sources, that were created for different purposes. Unfortunately, this approach often leads to noisy labels. In this paper, we propose a meta algorithm for tackling the noisy labels problem. The key idea is to decouple when to update'' from how to update''. We demonstrate the effectiveness of our algorithm by mining data for gender classification by combining the Labeled Faces in the Wild (LFW) face recognition dataset with a textual genderizing service, which leads to a noisy dataset.
Learning Boolean Circuits with Neural Networks
Malach, Eran, Shalev-Shwartz, Shai
Training neural-networks is computationally hard. However, in practice they are trained efficiently using gradient-based algorithms, achieving remarkable performance on natural data. To bridge this gap, we observe the property of local correlation: correlation between small patterns of the input and the target label. We focus on learning deep neural-networks with a variant of gradient-descent, when the target function is a tree-structured Boolean circuit. We show that in this case, the existence of correlation between the gates of the circuit and the target label determines whether the optimization succeeds or fails. Using this result, we show that neural-networks can learn the (log n)-parity problem for most product distributions. These results hint that local correlation may play an important role in differentiating between distributions that are hard or easy to learn.
On the Optimality of Trees Generated by ID3
Brutzkus, Alon, Daniely, Amit, Malach, Eran
Since its inception in the 1980s, ID3 has become one of the most successful and widely used algorithms for learning decision trees. However, its theoretical properties remain poorly understood. In this work, we analyze the heuristic of growing a decision tree with ID3 for a limited number of iterations $t$ and given that nodes are split as in the case of exact information gain and probability computations. In several settings, we provide theoretical and empirical evidence that the TopDown variant of ID3, introduced by Kearns and Mansour (1996), produces trees with optimal or near-optimal test error among all trees with $t$ internal nodes. We prove optimality in the case of learning conjunctions under product distributions and learning read-once DNFs with 2 terms under the uniform distribition. Using efficient dynamic programming algorithms, we empirically show that TopDown generates trees that are near-optimal ($\sim \%1$ difference from optimal test error) in a large number of settings for learning read-once DNFs under product distributions.
ID3 Learns Juntas for Smoothed Product Distributions
Brutzkus, Alon, Daniely, Amit, Malach, Eran
In recent years, there are many attempts to understand popular heuristics. An example of such a heuristic algorithm is the ID3 algorithm for learning decision trees. This algorithm is commonly used in practice, but there are very few theoretical works studying its behavior. In this paper, we analyze the ID3 algorithm, when the target function is a $k$-Junta, a function that depends on $k$ out of $n$ variables of the input. We prove that when $k = \log n$, the ID3 algorithm learns in polynomial time $k$-Juntas, in the smoothed analysis model of Kalai & Teng. That is, we show a learnability result when the observed distribution is a "noisy" variant of the original distribution.
Decoupling Gating from Linearity
Fiat, Jonathan, Malach, Eran, Shalev-Shwartz, Shai
ReLU neural-networks have been in the focus of many recent theoretical works, trying to explain their empirical success. Nonetheless, there is still a gap between current theoretical results and empirical observations, even in the case of shallow (one hidden-layer) networks. For example, in the task of memorizing a random sample of size $m$ and dimension $d$, the best theoretical result requires the size of the network to be $\tilde{\Omega}(\frac{m^2}{d})$, while empirically a network of size slightly larger than $\frac{m}{d}$ is sufficient. To bridge this gap, we turn to study a simplified model for ReLU networks. We observe that a ReLU neuron is a product of a linear function with a gate (the latter determines whether the neuron is active or not), where both share a jointly trained weight vector. In this spirit, we introduce the Gated Linear Unit (GaLU), which simply decouples the linearity from the gating by assigning different vectors for each role. We show that GaLU networks allow us to get optimization and generalization results that are much stronger than those available for ReLU networks. Specifically, we show a memorization result for networks of size $\tilde{\Omega}(\frac{m}{d})$, and improved generalization bounds. Finally, we show that in some scenarios, GaLU networks behave similarly to ReLU networks, hence proving to be a good choice of a simplified model.
Is Deeper Better only when Shallow is Good?
Malach, Eran, Shalev-Shwartz, Shai
Understanding the power of depth in feed-forward neural networks is an ongoing challenge in the field of deep learning theory. While current works account for the importance of depth for the expressive power of neural-networks, it remains an open question whether these benefits are exploited during a gradient-based optimization process. In this work we explore the relation between expressivity properties of deep networks and the ability to train them efficiently using gradient-based algorithms. We give a depth separation argument for distributions with fractal structure, showing that they can be expressed efficiently by deep networks, but not with shallow ones. These distributions have a natural coarse-to-fine structure, and we show that the balance between the coarse and fine details has a crucial effect on whether the optimization process is likely to succeed. We prove that when the distribution is concentrated on the fine details, gradient-based algorithms are likely to fail. Using this result we prove that, at least in some distributions, the success of learning deep networks depends on whether the distribution can be well approximated by shallower networks, and we conjecture that this property holds in general.
A Provably Correct Algorithm for Deep Learning that Actually Works
Malach, Eran, Shalev-Shwartz, Shai
We describe a layer-by-layer algorithm for training deep convolutional networks, where each step involves gradient updates for a two layer network followed by a simple clustering algorithm. Our algorithm stems from a deep generative model that generates mages level by level, where lower resolution images correspond to latent semantic classes. We analyze the convergence rate of our algorithm assuming that the data is indeed generated according to this model (as well as additional assumptions). While we do not pretend to claim that the assumptions are realistic for natural images, we do believe that they capture some true properties of real data. Furthermore, we show that our algorithm actually works in practice (on the CIFAR dataset), achieving results in the same ballpark as that of vanilla convolutional neural networks that are being trained by stochastic gradient descent. Finally, our proof techniques may be of independent interest.
Decoupling "when to update" from "how to update"
Malach, Eran, Shalev-Shwartz, Shai
Deep learning requires data. A useful approach to obtain data is to be creative and mine data from various sources, that were created for different purposes. Unfortunately, this approach often leads to noisy labels. In this paper, we propose a meta algorithm for tackling the noisy labels problem. The key idea is to decouple ``when to update'' from ``how to update''. We demonstrate the effectiveness of our algorithm by mining data for gender classification by combining the Labeled Faces in the Wild (LFW) face recognition dataset with a textual genderizing service, which leads to a noisy dataset. While our approach is very simple to implement, it leads to state-of-the-art results. We analyze some convergence properties of the proposed algorithm.