Zeyuan Allen-Zhu
NEON2: Finding Local Minima via First-Order Oracles
Zeyuan Allen-Zhu, Yuanzhi Li
Natasha 2: Faster Non-Convex Optimization Than SGD
Zeyuan Allen-Zhu
Is Q-Learning Provably Efficient?
Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, Michael I. Jordan
Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically simpler, more flexible to use, and thus more prevalent in modern deep RL than model-based approaches. However, empirical work has suggested that model-free algorithms may require more samples to learn [7, 22]. The theoretical question of "whether model-free algorithms can be made sample efficient" is one of the most fundamental questions in RL, and remains unsolved even in the basic scenario with finitely many states and actions.
NEON2: Finding Local Minima via First-Order Oracles
Zeyuan Allen-Zhu, Yuanzhi Li
We propose a reduction for non-convex optimization that can (1) turn an stationary-point finding algorithm into an local-minimum finding one, and (2) replace the Hessian-vector product computations with only gradient computations. It works both in the stochastic and the deterministic settings, without hurting the algorithm's performance. As applications, our reduction turns Natasha2 into a first-order method without hurting its theoretical performance. It also converts SGD, GD, SCSG, and SVRG into algorithms finding approximate local minima, outperforming some best known results.
Natasha 2: Faster Non-Convex Optimization Than SGD
Zeyuan Allen-Zhu
Can SGD Learn Recurrent Neural Networks with Provable Generalization?
Zeyuan Allen-Zhu, Yuanzhi Li
Recurrent Neural Networks (RNNs) are among the most popular models in sequential data analysis. Yet, in the foundational PAC learning language, what concept class can it learn? Moreover, how can the same recurrent unit simultaneously learn functions from different input tokens to different output tokens, without affecting each other?
Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
Zeyuan Allen-Zhu, Yuanzhi Li, Yingyu Liang
The fundamental learning theory behind neural networks remains largely open. What classes of functions can neural networks actually learn? Why doesn't the trained network overfit when it is overparameterized? In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. Moreover, the learning can be simply done by SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. The sample complexity can also be almost independent of the number of parameters in the network. On the technique side, our analysis goes beyond the so-called NTK (neural tangent kernel) linearization of neural networks in prior works. We establish a new notion of quadratic approximation of the neural network, and connect it to the SGD theory of escaping saddle points.
What Can ResNet Learn Efficiently, Going Beyond Kernels?
Zeyuan Allen-Zhu, Yuanzhi Li
How can neural networks such as ResNet efficiently learn CIFAR-10 with test accuracy more than 96%, while other methods, especially kernel methods, fall relatively behind? Can we more provide theoretical justifications for this gap? Recently, there is an influential line of work relating neural networks to kernels in the over-parameterized regime, proving they can learn certain concept class that is also learnable by kernels with similar test error. Yet, can neural networks provably learn some concept class better than kernels? We answer this positively in the distribution-free setting.