Not enough data to create a plot.
Try a different view from the menu above.
Li, Zhiyuan
Understanding Generalization of Deep Neural Networks Trained with Noisy Labels
Hu, Wei, Li, Zhiyuan, Yu, Dingli
Over-parameterized deep neural networks trained by simple first-order methods are known to be able to fit any labeling of data. Such over-fitting ability hinders generalization when mislabeled training examples are present. On the other hand, simple regularization methods like early-stopping seem to help generalization a lot in these scenarios. This paper makes progress towards theoretically explaining generalization of over-parameterized deep neural networks trained with noisy labels. Two simple regularization methods are analyzed: (i) regularization by the distance between the network parameters to initialization, and (ii) adding a trainable auxiliary variable to the network output for each training example. Theoretically, we prove that gradient descent training with either of these two methods leads to a generalization guarantee on the true data distribution despite being trained using noisy labels. The generalization bound is independent of the network size, and is comparable to the bound one can get when there is no label noise. Empirical results verify the effectiveness of these methods on noisily labeled datasets.
On Exact Computation with an Infinitely Wide Neural Net
Arora, Sanjeev, Du, Simon S., Hu, Wei, Li, Zhiyuan, Salakhutdinov, Ruslan, Wang, Ruosong
How well does a classic deep net architecture like AlexNet or VGG19 classify on a standard dataset such as CIFAR-10 when its "width" --- namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers --- is allowed to increase to infinity? Such questions have come to the forefront in the quest to theoretically understand deep learning and its mysteries about optimization and generalization. They also connect deep learning to notions such as Gaussian processes and kernels. A recent paper [Jacot et al., 2018] introduced the Neural Tangent Kernel (NTK) which captures the behavior of fully-connected deep nets in the infinite width limit trained by gradient descent; this object was implicit in some other recent papers. A subsequent paper [Lee et al., 2019] gave heuristic Monte Carlo methods to estimate the NTK and its extension, Convolutional Neural Tangent Kernel (CNTK) and used this to try to understand the limiting behavior on datasets like CIFAR-10. The current paper gives the first efficient exact algorithm (based upon dynamic programming) for computing CNTK as well as an efficient GPU implementation of this algorithm. This results in a significant new benchmark for performance of a pure kernel-based method on CIFAR-10, being 10% higher than the methods reported in [Novak et al., 2019], and only 5% lower than the performance of the corresponding finite deep net architecture (once batch normalization etc. are turned off). We give the first non-asymptotic proof showing that a fully-trained sufficiently wide net is indeed equivalent to the kernel regression predictor using NTK. Our experiments also demonstrate that earlier Monte Carlo approximation can degrade the performance significantly, thus highlighting the power of our exact kernel computation, which we have applied even to the full CIFAR-10 dataset and 20-layer nets.
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
Arora, Sanjeev, Du, Simon S., Hu, Wei, Li, Zhiyuan, Wang, Ruosong
The well-known work of Zhang et al. (2017) highlighted intriguing experimental phenomena about deep net training - specifically, optimization and generalization-and asked whether theory could explain them. They showed that sufficiently powerful nets (with vastly more parameters than number of training samples) can attain zero training error, regardless of whether the data is properly labeled or randomly labeled. Obviously, trainingwith randomly labeled data cannot generalize, whereas training with properly labeled data generalizes. See Figure 2 replicating some of these results. Recent papers have begun to provide explanations, showing that gradient descent can allow an overparametrized multi-layernet to attain arbitrarily low training error on fairly generic datasets (Du et al., 2018a,c; Li & Liang, 2018; Allen-Zhu et al., 2018b; Zou et al., 2018), provided the amount of overparametrization isa high polynomial of the relevant parameters (i.e.
Theoretical Analysis of Auto Rate-Tuning by Batch Normalization
Arora, Sanjeev, Li, Zhiyuan, Lyu, Kaifeng
Batch Normalization (BN) has become a cornerstone of deep learning across diverse architectures, appearing to help optimization as well as generalization. While the idea makes intuitive sense, theoretical analysis of its effectiveness has been lacking. Here theoretical support is provided for one of its conjectured properties, namely, the ability to allow gradient descent to succeed with less tuning of learning rates. It is shown that even if we fix the learning rate of scale-invariant parameters (e.g., weights of each layer with BN) to a constant (say, $0.3$), gradient descent still approaches a stationary point (i.e., a solution where gradient is zero) in the rate of $T^{-1/2}$ in $T$ iterations, asymptotically matching the best bound for gradient descent with well-tuned learning rates. A similar result with convergence rate $T^{-1/4}$ is also shown for stochastic gradient descent.
Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks
Neyshabur, Behnam, Li, Zhiyuan, Bhojanapalli, Srinadh, LeCun, Yann, Srebro, Nathan
Despite existing work on ensuring generalization of neural networks in terms of scale sensitive complexity measures, such as norms, margin and sharpness, these complexity measures do not offer an explanation of why neural networks generalize better with over-parametrization. In this work we suggest a novel complexity measure based on unit-wise capacities resulting in a tighter generalization bound for two layer ReLU networks. Our capacity bound correlates with the behavior of test error with increasing network sizes, and could potentially explain the improvement in generalization with over-parametrization. We further present a matching lower bound for the Rademacher complexity that improves over previous capacity lower bounds for neural networks.
Online Improper Learning with an Approximation Oracle
Hazan, Elad, Hu, Wei, Li, Yuanzhi, Li, Zhiyuan
We revisit the question of reducing online learning to approximate optimization of the offline problem. In this setting, we give two algorithms with near-optimal performance in the full information setting: they guarantee optimal regret and require only poly-logarithmically many calls to the approximation oracle per iteration. Furthermore, these algorithms apply to the more general improper learning problems. In the bandit setting, our algorithm also significantly improves the best previously known oracle complexity while maintaining the same regret.
Learning in Games: Robustness of Fast Convergence
Foster, Dylan J., Li, Zhiyuan, Lykouris, Thodoris, Sridharan, Karthik, Tardos, Eva
We show that learning algorithms satisfying a low approximate regret property experience fast convergence to approximate optimality in a large class of repeated games. Our property, which simply requires that each learner has small regret compared to a (1)-multiplicative approximation to the best action in hindsight, is ubiquitous among learning algorithms; it is satisfied even by the vanilla Hedge forecaster. Our results improve upon recent work of Syrgkanis et al. [28] in a number of ways. We require only that players observe payoffs under other players' realized actions, as opposed to expected payoffs. We further show that convergence occurs with high probability, and show convergence under bandit feedback. Finally, we improve upon the speed of convergence by a factor of n, the number of players. Both the scope of settings and the class of algorithms for which our analysis provides fast convergence are considerably broader than in previous work. Our framework applies to dynamic population games via a low approximate regret property for shifting experts. Here we strengthen the results of Lykouris et al. [19] in two ways: We allow players to select learning algorithms from a larger class, which includes a minor variant of the basic Hedge algorithm, and we increase the maximum churn in players for which approximate optimality is achieved.
Solving Marginal MAP Problems with NP Oracles and Parity Constraints
Xue, Yexiang, Li, Zhiyuan, Ermon, Stefano, Gomes, Carla P., Selman, Bart
Arising from many applications at the intersection of decision-making and machine learning, Marginal Maximum A Posteriori (Marginal MAP) problems unify the two main classes of inference, namely maximization (optimization) and marginal inference (counting), and are believed to have higher complexity than both of them. We propose XOR_MMAP, a novel approach to solve the Marginal MAP problem, which represents the intractable counting subproblem with queries to NP oracles, subject to additional parity constraints. XOR_MMAP provides a constant factor approximation to the Marginal MAP problem, by encoding it as a single optimization in a polynomial size of the original problem. We evaluate our approach in several machine learning and decision-making applications, and show that our approach outperforms several state-of-the-art Marginal MAP solvers.