Not enough data to create a plot.
Try a different view from the menu above.
Ravikumar, Pradeep
Iterative Barycenter Flows
Inouye, David I., Zhou, Zeyu, Gong, Ziyu, Ravikumar, Pradeep
The task of mapping two or more distributions to a shared representation has many applications including fair representations, batch effect mitigation, and unsupervised domain adaptation. However, most existing formulations only consider the setting of two distributions, and moreover, do not have an identifiable, unique shared latent representation. We use optimal transport theory to consider a natural multiple distribution extension of the Monge assignment problem we call the symmetric Monge map problem and show that it is equivalent to the Wasserstein barycenter problem. Yet, the maps to the barycenter are challenging to estimate. Prior methods often ignore transportation cost, rely on adversarial methods, or only work for discrete distributions. Therefore, our goal is to estimate invertible maps between two or more distributions and their corresponding barycenter via a simple iterative flow method. Our method decouples each iteration into two subproblems: 1) estimate simple distributions and 2) estimate the invertible maps to the barycenter via known closed-form OT results. Our empirical results give evidence that this iterative algorithm approximates the maps to the barycenter.
Contrastive learning of strong-mixing continuous-time stochastic processes
Liu, Bingbin, Ravikumar, Pradeep, Risteski, Andrej
One of the paradigms of learning from unlabeled data that has seen a lot of recent work in various application domains is "self-supervised learning". These methods supervise the training process with information inherent to the data without requiring human annotations, and have been applied across computer vision, natural language processing, reinforcement learning and scientific domains. Despite the popularity, they are still not very well understood--both on the theoretical and empirical front--often requiring extensive trial and error to find the right pairing of architecture and learning method. In particular, it is often hard to pin down what exactly these methods are trying to learn, and it is even harder to determine what is their statistical and algorithmic complexity. The specific family of self-supervised approaches we focus on in this work is contrastive learning, which constructs different types of tuples by utilizing certain structures in the data and trains the model to identify the types. For an example in vision, Chen et al. (2020) apply two random augmentations (e.g.
An Online Learning Approach to Interpolation and Extrapolation in Domain Generalization
Rosenfeld, Elan, Ravikumar, Pradeep, Risteski, Andrej
Modern machine learning algorithms excel when the training and test distributions match but often fail under even moderate distribution shift (Beery et al., 2018); learning a predictor which generalizes to distributions which differ from the training data is therefore an important task. This objective, broadly referred to as out-of-distribution (OOD) generalization, is not realizable in general, so researchers have formalized several possible restrictions. Common choices include a structural assumption such as covariate or label shift (Widmer & Kubat, 1996; Bickel et al., 2009; Lipton et al., 2018) or expecting that the test distribution will lie in some uncertainty set around the training distribution (Bagnell, 2005; Rahimian & Mehrotra, 2019). One popular assumption is that the training data is comprised of a collection of "environments" (Blanchard et al., 2011; Muandet et al., 2013; Peters et al., 2016) or "groups" (Sagawa et al., 2020), each representing a distinct distribution, where the group identity of each sample is known. The hope is that by cleverly training on such a combination of groups, one can derive a robust predictor which will better transfer to unseen test data which relates to the observed distributions--such a task is known as domain generalization.
On Proximal Policy Optimization's Heavy-tailed Gradients
Garg, Saurabh, Zhanson, Joshua, Parisotto, Emilio, Prasad, Adarsh, Kolter, J. Zico, Balakrishnan, Sivaraman, Lipton, Zachary C., Salakhutdinov, Ruslan, Ravikumar, Pradeep
Modern policy gradient algorithms, notably Proximal Policy Optimization (PPO), rely on an arsenal of heuristics, including loss clipping and gradient clipping, to ensure successful learning. These heuristics are reminiscent of techniques from robust statistics, commonly used for estimation in outlier-rich ("heavy-tailed") regimes. In this paper, we present a detailed empirical study to characterize the heavy-tailed nature of the gradients of the PPO surrogate reward function. We demonstrate that the gradients, especially for the actor network, exhibit pronounced heavy-tailedness and that it increases as the agent's policy diverges from the behavioral policy (i.e., as the agent goes further off policy). Further examination implicates the likelihood ratios and advantages in the surrogate reward as the main sources of the observed heavy-tailedness. We then highlight issues arising due to the heavy-tailed nature of the gradients. In this light, we study the effects of the standard PPO clipping heuristics, demonstrating that these tricks primarily serve to offset heavy-tailedness in gradients. Thus motivated, we propose incorporating GMOM, a high-dimensional robust estimator, into PPO as a substitute for three clipping tricks. Despite requiring less hyperparameter tuning, our method matches the performance of PPO (with all heuristics enabled) on a battery of MuJoCo continuous control tasks.
When Is Generalizable Reinforcement Learning Tractable?
Malik, Dhruv, Li, Yuanzhi, Ravikumar, Pradeep
Agents trained by reinforcement learning (RL) often fail to generalize beyond the environment they were trained in, even when presented with new scenarios that seem very similar to the training environment. We study the query complexity required to train RL agents that can generalize to multiple environments. Intuitively, tractable generalization is only possible when the environments are similar or close in some sense. To capture this, we introduce Strong Proximity, a structural condition which precisely characterizes the relative closeness of different environments. We provide an algorithm which exploits Strong Proximity to provably and efficiently generalize. We also show that under a natural weakening of this condition, which we call Weak Proximity, RL can require query complexity that is exponential in the horizon to generalize. A key consequence of our theory is that even when the environments share optimal trajectories, and have highly similar reward and transition functions (as measured by classical metrics), tractable generalization is impossible.
Fundamental Limits and Tradeoffs in Invariant Representation Learning
Zhao, Han, Dan, Chen, Aragam, Bryon, Jaakkola, Tommi S., Gordon, Geoffrey J., Ravikumar, Pradeep
Many machine learning applications involve learning representations that achieve two competing goals: To maximize information or accuracy with respect to a subset of features (e.g.\ for prediction) while simultaneously maximizing invariance or independence with respect to another, potentially overlapping, subset of features (e.g.\ for fairness, privacy, etc). Typical examples include privacy-preserving learning, domain adaptation, and algorithmic fairness, just to name a few. In fact, all of the above problems admit a common minimax game-theoretic formulation, whose equilibrium represents a fundamental tradeoff between accuracy and invariance. Despite its abundant applications in the aforementioned domains, theoretical understanding on the limits and tradeoffs of invariant representations is severely lacking. In this paper, we provide an information-theoretic analysis of this general and important problem under both classification and regression settings. In both cases, we analyze the inherent tradeoffs between accuracy and invariance by providing a geometric characterization of the feasible region in the information plane, where we connect the geometric properties of this feasible region to the fundamental limitations of the tradeoff problem. In the regression setting, we also derive a tight lower bound on the Lagrangian objective that quantifies the tradeoff between accuracy and invariance. This lower bound leads to a better understanding of the tradeoff via the spectral properties of the joint distribution. In both cases, our results shed new light on this fundamental problem by providing insights on the interplay between accuracy and invariance. These results deepen our understanding of this fundamental problem and may be useful in guiding the design of adversarial representation learning algorithms.
The Risks of Invariant Risk Minimization
Rosenfeld, Elan, Ravikumar, Pradeep, Risteski, Andrej
Invariant Causal Prediction (Peters et al., 2016) is a technique for out-of-distribution generalization which assumes that some aspects of the data distribution vary across the training set but that the underlying causal mechanisms remain constant. Recently, Arjovsky et al. (2019) proposed Invariant Risk Minimization (IRM), an objective based on this idea for learning deep, invariant features of data which are a complex function of latent variables; many alternatives have subsequently been suggested. However, formal guarantees for all of these works are severely lacking. In this paper, we present the first analysis of classification under the IRM objective$-$as well as these recently proposed alternatives$-$under a fairly natural and general model. In the linear case, we show simple conditions under which the optimal solution succeeds or, more often, fails to recover the optimal invariant predictor. We furthermore present the very first results in the non-linear regime: we demonstrate that IRM can fail catastrophically unless the test data are sufficiently similar to the training distribution$-$this is precisely the issue that it was intended to solve. Thus, in this setting we find that IRM and its alternatives fundamentally do not improve over standard Empirical Risk Minimization.
Sharp Statistical Guarantees for Adversarially Robust Gaussian Classification
Dan, Chen, Wei, Yuting, Ravikumar, Pradeep
Adversarial robustness has become a fundamental requirement in modern machine learning applications. Yet, there has been surprisingly little statistical understanding so far. In this paper, we provide the first result of the optimal minimax guarantees for the excess risk for adversarially robust classification, under Gaussian mixture model proposed by \cite{schmidt2018adversarially}. The results are stated in terms of the Adversarial Signal-to-Noise Ratio (AdvSNR), which generalizes a similar notion for standard linear classification to the adversarial setting. For the Gaussian mixtures with AdvSNR value of $r$, we establish an excess risk lower bound of order $\Theta(e^{-(\frac{1}{8}+o(1)) r^2} \frac{d}{n})$ and design a computationally efficient estimator that achieves this optimal rate. Our results built upon minimal set of assumptions while cover a wide spectrum of adversarial perturbations including $\ell_p$ balls for any $p \ge 1$.
Sub-Seasonal Climate Forecasting via Machine Learning: Challenges, Analysis, and Advances
He, Sijie, Li, Xinyan, DelSole, Timothy, Ravikumar, Pradeep, Banerjee, Arindam
Sub-seasonal climate forecasting (SSF) focuses on predicting key climate variables such as temperature and precipitation in the 2-week to 2-month time scales. Skillful SSF would have immense societal value, in areas such as agricultural productivity, water resource management, transportation and aviation systems, and emergency planning for extreme weather events. However, SSF is considered more challenging than either weather prediction or even seasonal prediction. In this paper, we carefully study a variety of machine learning (ML) approaches for SSF over the US mainland. While atmosphere-land-ocean couplings and the limited amount of good quality data makes it hard to apply black-box ML naively, we show that with carefully constructed feature representations, even linear regression models, e.g., Lasso, can be made to perform well. Among a broad suite of 10 ML approaches considered, gradient boosting performs the best, and deep learning (DL) methods show some promise with careful architecture choices. Overall, suitable ML methods are able to outperform the climatological baseline, i.e., predictions based on the 30-year average at a given location and time. Further, based on studying feature importance, ocean (especially indices based on climatic oscillations such as El Nino) and land (soil moisture) covariates are found to be predictive, whereas atmospheric covariates are not considered helpful.
Learning Minimax Estimators via Online Learning
Gupta, Kartik, Suggala, Arun Sai, Prasad, Adarsh, Netrapalli, Praneeth, Ravikumar, Pradeep
We consider the problem of designing minimax estimators for estimating the parameters of a probability distribution. Unlike classical approaches such as the MLE and minimum distance estimators, we consider an algorithmic approach for constructing such estimators. We view the problem of designing minimax estimators as finding a mixed strategy Nash equilibrium of a zero-sum game. By leveraging recent results in online learning with non-convex losses, we provide a general algorithm for finding a mixed-strategy Nash equilibrium of general non-convex non-concave zero-sum games. Our algorithm requires access to two subroutines: (a) one which outputs a Bayes estimator corresponding to a given prior probability distribution, and (b) one which computes the worst-case risk of any given estimator. Given access to these two subroutines, we show that our algorithm outputs both a minimax estimator and a least favorable prior. To demonstrate the power of this approach, we use it to construct provably minimax estimators for classical problems such as estimation in the finite Gaussian sequence model, and linear regression.