Block, Adam
The Sample Complexity of Approximate Rejection Sampling with Applications to Smoothed Online Learning
Block, Adam, Polyanskiy, Yury
Suppose we are given access to $n$ independent samples from distribution $\mu$ and we wish to output one of them with the goal of making the output distributed as close as possible to a target distribution $\nu$. In this work we show that the optimal total variation distance as a function of $n$ is given by $\tilde\Theta(\frac{D}{f'(n)})$ over the class of all pairs $\nu,\mu$ with a bounded $f$-divergence $D_f(\nu\|\mu)\leq D$. Previously, this question was studied only for the case when the Radon-Nikodym derivative of $\nu$ with respect to $\mu$ is uniformly bounded. We then consider an application in the seemingly very different field of smoothed online learning, where we show that recent results on the minimax regret and the regret of oracle-efficient algorithms still hold even under relaxed constraints on the adversary (to have bounded $f$-divergence, as opposed to bounded Radon-Nikodym derivative). Finally, we also study efficacy of importance sampling for mean estimates uniform over a function class and compare importance sampling with rejection sampling.
Oracle-Efficient Smoothed Online Learning for Piecewise Continuous Decision Making
Block, Adam, Rakhlin, Alexander, Simchowitz, Max
The online learning setting has become the most popular regime for studying sequential decision making with dependent and potentially adversarial data. While this paradigm is attractive due to its great generality and minimal set of assumptions [Cesa-Bianchi and Lugosi, 2006], the worstcase nature of the adversary creates statistical and computational challenges [Rakhlin et al., 2015, Littlestone, 1988, Hazan and Koren, 2016]. In order to mitigate these difficulties, Rakhlin et al. [2011] proposed the smoothed setting, wherein the adversary is constrained to sample data from a distribution whose likelihood ratio is bounded above by 1/ฯ with respect to a fixed dominating measure, which ensures that the adversary cannot choose worst-case inputs with high probability. As in other online learning settings, performance is measured via regret with respect to a best-inhindsight comparator [Cesa-Bianchi and Lugosi, 2006]. Recent works have demonstrated strong computational-statistical tradeoffs in smoothed online learning: while there are statisticaly efficient algorithms that can enjoy regret logarithmic in 1/ฯ, oracle-efficient algorithms necessarily suffer regret scaling polynomially in 1/ฯ [Haghtalab et al., 2022a,b, Block et al., 2022], where the learner is assumed access to an Empirical Risk Minimization (ERM) oracle that is able to efficiently optimize functionals on the parameter space. This gap is significant, because in many applications of interest, the natural scaling of ฯ is exponential in ambient problem dimension [Block and Simchowitz, 2022]. A natural question remains: under which types of smoothing is it possible to design oracleefficient algorithms with regret that scales polynomially in problem dimension? A partial answer was provided by Block and Simchowitz [2022], who demonstrate an efficient algorithm based on the John Ellipsoid which attains log(T/ฯ) poly(dimension)-regret for noiseless linear classification, and for a suitable generalization to classification with polynomial features.
Smoothed Online Learning for Prediction in Piecewise Affine Systems
Block, Adam, Simchowitz, Max, Tedrake, Russ
The problem of piecewise affine (PWA) regression and planning is of foundational importance to the study of online learning, control, and robotics, where it provides a theoretically and empirically tractable setting to study systems undergoing sharp changes in the dynamics. Unfortunately, due to the discontinuities that arise when crossing into different ``pieces,'' learning in general sequential settings is impossible and practical algorithms are forced to resort to heuristic approaches. This paper builds on the recently developed smoothed online learning framework and provides the first algorithms for prediction and simulation in PWA systems whose regret is polynomial in all relevant problem parameters under a weak smoothness assumption; moreover, our algorithms are efficient in the number of calls to an optimization oracle. We further apply our results to the problems of one-step prediction and multi-step simulation regret in piecewise affine dynamical systems, where the learner is tasked with simulating trajectories and regret is measured in terms of the Wasserstein distance between simulated and true data. Along the way, we develop several technical tools of more general interest.
Smoothed Online Learning is as Easy as Statistical Learning
Block, Adam, Dagan, Yuval, Golowich, Noah, Rakhlin, Alexander
Much of modern learning theory has been split between two regimes: the classical \emph{offline} setting, where data arrive independently, and the \emph{online} setting, where data arrive adversarially. While the former model is often both computationally and statistically tractable, the latter requires no distributional assumptions. In an attempt to achieve the best of both worlds, previous work proposed the smooth online setting where each sample is drawn from an adversarially chosen distribution, which is smooth, i.e., it has a bounded density with respect to a fixed dominating measure. We provide tight bounds on the minimax regret of learning a nonparametric function class, with nearly optimal dependence on both the horizon and smoothness parameters. Furthermore, we provide the first oracle-efficient, no-regret algorithms in this setting. In particular, we propose an oracle-efficient improper algorithm whose regret achieves optimal dependence on the horizon and a proper algorithm requiring only a single oracle call per round whose regret has the optimal horizon dependence in the classification setting and is sublinear in general. Both algorithms have exponentially worse dependence on the smoothness parameter of the adversary than the minimax rate. We then prove a lower bound on the oracle complexity of any proper learning algorithm, which matches the oracle-efficient upper bounds up to a polynomial factor, thus demonstrating the existence of a statistical-computational gap in smooth online learning. Finally, we apply our results to the contextual bandit setting to show that if a function class is learnable in the classical setting, then there is an oracle-efficient, no-regret algorithm for contextual bandits in the case that contexts arrive in a smooth manner.
Intrinsic Dimension Estimation
Block, Adam, Jia, Zeyu, Polyanskiy, Yury, Rakhlin, Alexander
It has long been thought that high-dimensional data encountered in many practical machine learning tasks have low-dimensional structure, i.e., the manifold hypothesis holds. A natural question, thus, is to estimate the intrinsic dimension of a given population distribution from a finite sample. We introduce a new estimator of the intrinsic dimension and provide finite sample, non-asymptotic guarantees. We then apply our techniques to get new sample complexity bounds for Generative Adversarial Networks (GANs) depending only on the intrinsic dimension of the data.
Majorizing Measures, Sequential Complexities, and Online Learning
Block, Adam, Dagan, Yuval, Rakhlin, Sasha
We introduce the technique of generic chaining and majorizing measures for controlling sequential Rademacher complexity. We relate majorizing measures to the notion of fractional covering numbers, which we show to be dominated in terms of sequential scale-sensitive dimensions in a horizon-independent way, and, under additional complexity assumptions establish a tight control on worst-case sequential Rademacher complexity in terms of the integral of sequential scale-sensitive dimension. Finally, we establish a tight contraction inequality for worst-case sequential Rademacher complexity. The above constitutes the resolution of a number of outstanding open problems in extending the classical theory of empirical processes to the sequential case, and, in turn, establishes sharp results for online learning.
Fast Mixing of Multi-Scale Langevin Dynamics under the Manifold Hypothesis
Block, Adam, Mroueh, Youssef, Rakhlin, Alexander, Ross, Jerret
Recently, the task of image generation has attracted much attention. In particular, the recent empirical successes of the Markov Chain Monte Carlo (MCMC) technique of Langevin Dynamics have prompted a number of theoretical advances; despite this, several outstanding problems remain. First, the Langevin Dynamics is run in very high dimension on a nonconvex landscape; in the worst case, due to the NP-hardness of nonconvex optimization, it is thought that Langevin Dynamics mixes only in time exponential in the dimension. In this work, we demonstrate how the manifold hypothesis allows for the considerable reduction of mixing time, from exponential in the ambient dimension to depending only on the (much smaller) intrinsic dimension of the data. Second, the high dimension of the sampling space significantly hurts the performance of Langevin Dynamics; we leverage a multi-scale approach to help ameliorate this issue and observe that this multi-resolution algorithm allows for a trade-off between image quality and computational expense in generation.
Generative Modeling with Denoising Auto-Encoders and Langevin Sampling
Block, Adam, Mroueh, Youssef, Rakhlin, Alexander
We study convergence of a generative modeling method that first estimates the score function of the distribution using Denoising Auto-Encoders (DAE) or Denoising Score Matching (DSM) and then employs Langevin diffusion for sampling. We show that both DAE and DSM provide estimates of the score of the Gaussian smoothed population density, allowing us to apply the machinery of Empirical Processes. We overcome the challenge of relying only on $L^2$ bounds on the score estimation error and provide finite-sample bounds in the Wasserstein distance between the law of the population distribution and the law of this sampling scheme. We then apply our results to the homotopy method of arXiv:1907.05600 and provide theoretical justification for its empirical success.