Mei, Song
Local convexity of the TAP free energy and AMP convergence for Z2-synchronization
Celentano, Michael, Fan, Zhou, Mei, Song
We study mean-field variational Bayesian inference using the TAP approach, for Z2-synchronization as a prototypical example of a high-dimensional Bayesian model. We show that for any signal strength $\lambda > 1$ (the weak-recovery threshold), there exists a unique local minimizer of the TAP free energy functional near the mean of the Bayes posterior law. Furthermore, the TAP free energy in a local neighborhood of this minimizer is strongly convex. Consequently, a natural-gradient/mirror-descent algorithm achieves linear convergence to this minimizer from a local initialization, which may be obtained by a constant number of iterates of Approximate Message Passing (AMP). This provides a rigorous foundation for variational inference in high dimensions via minimization of the TAP free energy. We also analyze the finite-sample convergence of AMP, showing that AMP is asymptotically stable at the TAP minimizer for any $\lambda > 1$, and is linearly convergent to this minimizer from a spectral initialization for sufficiently large $\lambda$. Such a guarantee is stronger than results obtainable by state evolution analyses, which only describe a fixed number of AMP iterations in the infinite-sample limit. Our proofs combine the Kac-Rice formula and Sudakov-Fernique Gaussian comparison inequality to analyze the complexity of critical points that satisfy strong convexity and stability conditions within their local neighborhoods.
Lower Bounds for Learning in Revealing POMDPs
Chen, Fan, Wang, Huan, Xiong, Caiming, Mei, Song, Bai, Yu
This paper studies the fundamental limits of reinforcement learning (RL) in the challenging \emph{partially observable} setting. While it is well-established that learning in Partially Observable Markov Decision Processes (POMDPs) requires exponentially many samples in the worst case, a surge of recent work shows that polynomial sample complexities are achievable under the \emph{revealing condition} -- A natural condition that requires the observables to reveal some information about the unobserved latent states. However, the fundamental limits for learning in revealing POMDPs are much less understood, with existing lower bounds being rather preliminary and having substantial gaps from the current best upper bounds. We establish strong PAC and regret lower bounds for learning in revealing POMDPs. Our lower bounds scale polynomially in all relevant problem parameters in a multiplicative fashion, and achieve significantly smaller gaps against the current best upper bounds, providing a solid starting point for future studies. In particular, for \emph{multi-step} revealing POMDPs, we show that (1) the latent state-space dependence is at least $\Omega(S^{1.5})$ in the PAC sample complexity, which is notably harder than the $\widetilde{\Theta}(S)$ scaling for fully-observable MDPs; (2) Any polynomial sublinear regret is at least $\Omega(T^{2/3})$, suggesting its fundamental difference from the \emph{single-step} case where $\widetilde{O}(\sqrt{T})$ regret is achievable. Technically, our hard instance construction adapts techniques in \emph{distribution testing}, which is new to the RL literature and may be of independent interest.
Partially Observable RL with B-Stability: Unified Structural Condition and Sharp Sample-Efficient Algorithms
Partial Observability -- where agents can only observe partial information about the true underlying state of the system -- is ubiquitous in real-world applications of Reinforcement Learning (RL). Theoretically, learning a near-optimal policy under partial observability is known to be hard in the worst case due to an exponential sample complexity lower bound. Recent work has identified several tractable subclasses that are learnable with polynomial samples, such as Partially Observable Markov Decision Processes (POMDPs) with certain revealing or decodability conditions. However, this line of research is still in its infancy, where (1) unified structural conditions enabling sample-efficient learning are lacking; (2) existing sample complexities for known tractable subclasses are far from sharp; and (3) fewer sample-efficient algorithms are available than in fully observable RL. This paper advances all three aspects above for Partially Observable RL in the general setting of Predictive State Representations (PSRs). First, we propose a natural and unified structural condition for PSRs called \emph{B-stability}. B-stable PSRs encompasses the vast majority of known tractable subclasses such as weakly revealing POMDPs, low-rank future-sufficient POMDPs, decodable POMDPs, and regular PSRs. Next, we show that any B-stable PSR can be learned with polynomial samples in relevant problem parameters. When instantiated in the aforementioned subclasses, our sample complexities improve substantially over the current best ones. Finally, our results are achieved by three algorithms simultaneously: Optimistic Maximum Likelihood Estimation, Estimation-to-Decisions, and Model-Based Optimistic Posterior Sampling. The latter two algorithms are new for sample-efficient learning of POMDPs/PSRs.
Sample-Efficient Learning of Correlated Equilibria in Extensive-Form Games
Song, Ziang, Mei, Song, Bai, Yu
Imperfect-Information Extensive-Form Games (IIEFGs) is a prevalent model for real-world games involving imperfect information and sequential plays. The Extensive-Form Correlated Equilibrium (EFCE) has been proposed as a natural solution concept for multi-player general-sum IIEFGs. However, existing algorithms for finding an EFCE require full feedback from the game, and it remains open how to efficiently learn the EFCE in the more challenging bandit feedback setting where the game can only be learned by observations from repeated playing. This paper presents the first sample-efficient algorithm for learning the EFCE from bandit feedback. We begin by proposing $K$-EFCE -- a more generalized definition that allows players to observe and deviate from the recommended actions for $K$ times. The $K$-EFCE includes the EFCE as a special case at $K=1$, and is an increasingly stricter notion of equilibrium as $K$ increases. We then design an uncoupled no-regret algorithm that finds an $\varepsilon$-approximate $K$-EFCE within $\widetilde{\mathcal{O}}(\max_{i}X_iA_i^{K}/\varepsilon^2)$ iterations in the full feedback setting, where $X_i$ and $A_i$ are the number of information sets and actions for the $i$-th player. Our algorithm works by minimizing a wide-range regret at each information set that takes into account all possible recommendation histories. Finally, we design a sample-based variant of our algorithm that learns an $\varepsilon$-approximate $K$-EFCE within $\widetilde{\mathcal{O}}(\max_{i}X_iA_i^{K+1}/\varepsilon^2)$ episodes of play in the bandit feedback setting. When specialized to $K=1$, this gives the first sample-efficient algorithm for learning EFCE from bandit feedback.
Near-Optimal Learning of Extensive-Form Games with Imperfect Information
Bai, Yu, Jin, Chi, Mei, Song, Yu, Tiancheng
This paper resolves the open question of designing near-optimal algorithms for learning imperfect-information extensive-form games from bandit feedback. We present the first line of algorithms that require only $\widetilde{\mathcal{O}}((XA+YB)/\varepsilon^2)$ episodes of play to find an $\varepsilon$-approximate Nash equilibrium in two-player zero-sum games, where $X,Y$ are the number of information sets and $A,B$ are the number of actions for the two players. This improves upon the best known sample complexity of $\widetilde{\mathcal{O}}((X^2A+Y^2B)/\varepsilon^2)$ by a factor of $\widetilde{\mathcal{O}}(\max\{X, Y\})$, and matches the information-theoretic lower bound up to logarithmic factors. We achieve this sample complexity by two new algorithms: Balanced Online Mirror Descent, and Balanced Counterfactual Regret Minimization. Both algorithms rely on novel approaches of integrating \emph{balanced exploration policies} into their classical counterparts. We also extend our results to learning Coarse Correlated Equilibria in multi-player general-sum games.
Learning with convolution and pooling operations in kernel methods
Misiakiewicz, Theodor, Mei, Song
Recent empirical work has shown that hierarchical convolutional kernels inspired by convolutional neural networks (CNNs) significantly improve the performance of kernel methods in image classification tasks. A widely accepted explanation for the success of these architectures is that they encode hypothesis classes that are suitable for natural images. However, understanding the precise interplay between approximation and generalization in convolutional architectures remains a challenge. In this paper, we consider the stylized setting of covariates (image pixels) uniformly distributed on the hypercube, and fully characterize the RKHS of kernels composed of single layers of convolution, pooling, and downsampling operations. We then study the gain in sample efficiency of kernel methods using these kernels over standard inner-product kernels. In particular, we show that 1) the convolution layer breaks the curse of dimensionality by restricting the RKHS to `local' functions; 2) local pooling biases learning towards low-frequency functions, which are stable by small translations; 3) downsampling may modify the high-frequency eigenspaces but leaves the low-frequency part approximately unchanged. Notably, our results quantify how choosing an architecture adapted to the target function leads to a large improvement in the sample complexity.
The Three Stages of Learning Dynamics in High-Dimensional Kernel Methods
Ghosh, Nikhil, Mei, Song, Yu, Bin
To understand how deep learning works, it is crucial to understand the training dynamics of neural networks. Several interesting hypotheses about these dynamics have been made based on empirically observed phenomena, but there exists a limited theoretical understanding of when and why such phenomena occur. In this paper, we consider the training dynamics of gradient flow on kernel least-squares objectives, which is a limiting dynamics of SGD trained neural networks. Using precise high-dimensional asymptotics, we characterize the dynamics of the fitted model in two "worlds": in the Oracle World the model is trained on the population distribution and in the Empirical World the model is trained on a sampled dataset. We show that under mild conditions on the kernel and $L^2$ target regression function the training dynamics undergo three stages characterized by the behaviors of the models in the two worlds. Our theoretical results also mathematically formalize some interesting deep learning phenomena. Specifically, in our setting we show that SGD progressively learns more complex functions and that there is a "deep bootstrap" phenomenon: during the second stage, the test error of both worlds remain close despite the empirical training error being much smaller. Finally, we give a concrete example comparing the dynamics of two different kernels which shows that faster training is not necessary for better generalization.
When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently?
Song, Ziang, Mei, Song, Bai, Yu
Multi-agent reinforcement learning (RL) has achieved substantial recent successes in solving artificial intelligence challenges such as GO (Silver et al., 2016, 2018), multi-player games with team play such as Starcraft (Vinyals et al., 2019) and Dota2 (Berner et al., 2019), behavior learning in social interactions (Baker et al., 2019), and economic simulation (Zheng et al., 2020; Trott et al., 2021). In many applications, multi-agent RL is able to yield high quality policies for multi-player games with a large number of players (Wang et al., 2016; Yang et al., 2018). Despite these empirical progresses, theoretical understanding of when we can sample-efficiently solve multi-player games with a large number of players remains elusive, especially in the setting of multi-player Markov games. A main bottleneck here is the exponential blow-up of the joint action space--The total number of joint actions in a generic game with simultaneous plays is equal to the product of the number of actions for each player, which scales exponentially in the number of players.
Understanding the Under-Coverage Bias in Uncertainty Estimation
Bai, Yu, Mei, Song, Wang, Huan, Xiong, Caiming
This paper is concerned with the problem of uncertainty estimation in regression problems. Uncertainty estimation is an increasingly important task in modern machine learning applications--Models should not only make high-accuracy predictions, but also have a sense of how much the true label may deviate from the prediction. This capability is crucial for deploying machine learning in the real world, in particular in risk-sensitive domains such as medical AI [15, 29], self-driving cars [47], and so on. A common approach for uncertainty estimation in regression is to learn a quantile function or a prediction interval of the true label conditioned on the input, which provides useful distributional information about the label. Such learned quantiles are typically evaluated by their coverage, i.e., probability that it covers the true label on a new test example. For example, a learned 90% upper quantile function should be an actual upper bound of the true label at least 90% of the time. Algorithms for learning quantiles date back to the classical quantile regression [35], which estimates the quantile function by solving an empirical risk minimization problem with a suitable loss function that depends on the desired quantile level α.
Exact Gap between Generalization Error and Uniform Convergence in Random Feature Models
Yang, Zitong, Bai, Yu, Mei, Song
Recent work showed that there could be a large gap between the classical uniform convergence bound and the actual test error of zero-training-error predictors (interpolators) such as deep neural networks. To better understand this gap, we study the uniform convergence in the nonlinear random feature model and perform a precise theoretical analysis on how uniform convergence depends on the sample size and the number of parameters. We derive and prove analytical expressions for three quantities in this model: 1) classical uniform convergence over norm balls, 2) uniform convergence over interpolators in the norm ball (recently proposed by Zhou et al. (2020)), and 3) the risk of minimum norm interpolator. We show that, in the setting where the classical uniform convergence bound is vacuous (diverges to $\infty$), uniform convergence over the interpolators still gives a non-trivial bound of the test error of interpolating solutions. We also showcase a different setting where classical uniform convergence bound is non-vacuous, but uniform convergence over interpolators can give an improved sample complexity guarantee. Our result provides a first exact comparison between the test errors and uniform convergence bounds for interpolators beyond simple linear models.