Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning
Off-policy model-free deep reinforcement learning methods using previously collected data can improve sample efficiency over on-policy policy gradient techniques. On the other hand, on-policy algorithms are often more stable and easier to use. This paper examines, both theoretically and empirically, approaches to merging on-and off-policy updates for deep reinforcement learning. Theoretical results show that off-policy updates with a value function estimator can be interpolated with on-policy policy gradient updates whilst still satisfying performance bounds. Our analysis uses control variate methods to produce a family of policy gradient algorithms, with several recently proposed algorithms being special cases of this family. We then provide an empirical comparison of these techniques with the remaining algorithmic details fixed, and show how different mixing of off-policy gradient estimates with on-policy samples contribute to improvements in empirical performance. The final algorithm provides a generalization and unification of existing deep policy gradient techniques, has theoretical guarantees on the bias introduced by off-policy updates, and improves on the state-of-the-art model-free deep RL methods on a number of OpenAI Gym continuous control benchmarks.
Early stopping for kernel boosting algorithms: A general analysis with localized complexities
Early stopping of iterative algorithms is a widely-used form of regularization in statistical learning, commonly used in conjunction with boosting and related gradient-type algorithms. Although consistency results have been established in some settings, such estimators are less well-understood than their analogues based on penalized regularization. In this paper, for a relatively broad class of loss functions and boosting algorithms (including $L^2$-boost, LogitBoost and AdaBoost, among others), we connect the performance of a stopped iterate to the localized Rademacher/Gaussian complexity of the associated function class. This connection allows us to show that local fixed point analysis, now standard in the analysis of penalized estimators, can be used to derive optimal stopping rules. We derive such stopping rules in detail for various kernel classes, and illustrate the correspondence of our theory with practice for Sobolev kernel classes.
Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks
Deep neural networks are commonly developed and trained in 32-bit floating point format. Significant gains in performance and energy efficiency could be realized by training and inference in numerical formats optimized for deep learning. Despite advances in limited precision inference in recent years, training of neural networks in low bit-width remains a challenging problem. Here we present the Flexpoint data format, aiming at a complete replacement of 32-bit floating point format training and inference, designed to support modern deep network topologies without modifications. Flexpoint tensors have a shared exponent that is dynamically adjusted to minimize overflows and maximize available dynamic range. We validate Flexpoint by training AlexNet, a deep residual network and a generative adversarial network, using a simulator implemented with the \emph{neon} deep learning framework. We demonstrate that 16-bit Flexpoint closely matches 32-bit floating point in training all three models, without any need for tuning of model hyperparameters. Our results suggest Flexpoint as a promising numerical format for future hardware for training and inference.
Limitations on Variance-Reduction and Acceleration Schemes for Finite Sums Optimization
We study the conditions under which one is able to efficiently apply variance-reduction and acceleration schemes on finite sums problems. First, we show that perhaps surprisingly, the finite sum structure, by itself, is not sufficient for obtaining a complexity bound of $\tilde{\cO}((n+L/\mu)\ln(1/\epsilon))$ for $L$-smooth and $\mu$-strongly convex finite sums - one must also know exactly which individual function is being referred to by the oracle at each iteration. Next, we show that for a broad class of first-order and coordinate-descent finite sums algorithms (including, e.g., SDCA, SVRG, SAG), it is not possible to get an `accelerated' complexity bound of $\tilde{\cO}((n+\sqrt{n L/\mu})\ln(1/\epsilon))$, unless the strong convexity parameter is given explicitly. Lastly, we show that when this class of algorithms is used for minimizing $L$-smooth and non-strongly convex finite sums, the optimal complexity bound is $\tilde{\cO}(n+L/\epsilon)$, assuming that (on average) the same update rule is used for any iteration, and $\tilde{\cO}(n+\sqrt{nL/\epsilon})$, otherwise.
Adaptive Active Hypothesis Testing under Limited Information
We consider the problem of active sequential hypothesis testing where a Bayesian decision maker must infer the true hypothesis from a set of hypotheses. The decision maker may choose for a set of actions, where the outcome of an action is corrupted by independent noise. In this paper we consider a special case where the decision maker has limited knowledge about the distribution of observations for each action, in that only a binary value is observed. Our objective is to infer the true hypothesis with low error, while minimizing the number of action sampled. Our main results include the derivation of a lower bound on sample size for our system under limited knowledge and the design of an active learning policy that matches this lower bound and outperforms similar known algorithms.
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
Deep neural networks (NNs) are powerful black box predictors that have recently achieved impressive performance on a wide spectrum of tasks. Quantifying predictive uncertainty in NNs is a challenging and yet unsolved problem. Bayesian NNs, which learn a distribution over weights, are currently the state-of-the-art for estimating predictive uncertainty; however these require significant modifications to the training procedure and are computationally expensive compared to standard (non-Bayesian) NNs. We propose an alternative to Bayesian NNs that is simple to implement, readily parallelizable, requires very little hyperparameter tuning, and yields high quality predictive uncertainty estimates. Through a series of experiments on classification and regression benchmarks, we demonstrate that our method produces well-calibrated uncertainty estimates which are as good or better than approximate Bayesian NNs. To assess robustness to dataset shift, we evaluate the predictive uncertainty on test examples from known and unknown distributions, and show that our method is able to express higher uncertainty on out-of-distribution examples. We demonstrate the scalability of our method by evaluating predictive uncertainty estimates on ImageNet.
Imagination-Augmented Agents for Deep Reinforcement Learning
We introduce Imagination-Augmented Agents (I2As), a novel architecture for deep reinforcement learning combining model-free and model-based aspects. In contrast to most existing model-based reinforcement learning and planning methods, which prescribe how a model should be used to arrive at a policy, I2As learn to interpret predictions from a trained environment model to construct implicit plans in arbitrary ways, by using the predictions as additional context in deep policy networks. I2As show improved data efficiency, performance, and robustness to model misspecification compared to several strong baselines.
Towards Generalization and Simplicity in Continuous Control
The remarkable successes of deep learning in speech recognition and computer vision have motivated efforts to adapt similar techniques to other problem domains, including reinforcement learning (RL). Consequently, RL methods have produced rich motor behaviors on simulated robot tasks, with their success largely attributed to the use of multi-layer neural networks. This work is among the first to carefully study what might be responsible for these recent advancements. Our main result calls this emerging narrative into question by showing that much simpler architectures -- based on linear and RBF parameterizations -- achieve comparable performance to state of the art results. We not only study different policy representations with regard to performance measures at hand, but also towards robustness to external perturbations. We again find that the learned neural network policies --- under the standard training scenarios --- are no more robust than linear (or RBF) policies; in fact, all three are remarkably brittle. Finally, we then directly modify the training scenarios in order to favor more robust policies, and we again do not find a compelling case to favor multi-layer architectures. Overall, this study suggests that multi-layer architectures should not be the default choice, unless a side-by-side comparison to simpler architectures shows otherwise. More generally, we hope that these results lead to more interest in carefully studying the architectural choices, and associated trade-offs, for training generalizable and robust policies.
Certified Defenses for Data Poisoning Attacks
Machine learning systems trained on user-provided data are susceptible to data poisoning attacks, whereby malicious users inject false training data with the aim of corrupting the learned model. While recent work has proposed a number of attacks and defenses, little is understood about the worst-case loss of a defense in the face of a determined attacker. We address this by constructing approximate upper bounds on the loss across a broad family of attacks, for defenders that first perform outlier removal followed by empirical risk minimization. Our approximation relies on two assumptions: (1) that the dataset is large enough for statistical concentration between train and test error to hold, and (2) that outliers within the clean (non-poisoned) data do not have a strong effect on the model. Our bound comes paired with a candidate attack that often nearly matches the upper bound, giving us a powerful tool for quickly assessing defenses on a given dataset. Empirically, we find that even under a simple defense, the MNIST-1-7 and Dogfish datasets are resilient to attack, while in contrast the IMDB sentiment dataset can be driven from 12% to 23% test error by adding only 3% poisoned data.
GP CaKe: Effective brain connectivity with causal kernels
A fundamental goal in network neuroscience is to understand how activity in one brain region drives activity elsewhere, a process referred to as effective connectivity. Here we propose to model this causal interaction using integro-differential equations and causal kernels that allow for a rich analysis of effective connectivity. The approach combines the tractability and flexibility of autoregressive modeling with the biophysical interpretability of dynamic causal modeling. The causal kernels are learned nonparametrically using Gaussian process regression, yielding an efficient framework for causal inference. We construct a novel class of causal covariance functions that enforce the desired properties of the causal kernels, an approach which we call GP CaKe. By construction, the model and its hyperparameters have biophysical meaning and are therefore easily interpretable. We demonstrate the efficacy of GP CaKe on a number of simulations and give an example of a realistic application on magnetoencephalography (MEG) data.