Goto

Collaborating Authors

 Curi, Sebastian


Gradient-Based Trajectory Optimization With Learned Dynamics

arXiv.org Artificial Intelligence

Trajectory optimization methods have achieved an exceptional level of performance on real-world robots in recent years. These methods heavily rely on accurate analytical models of the dynamics, yet some aspects of the physical world can only be captured to a limited extent. An alternative approach is to leverage machine learning techniques to learn a differentiable dynamics model of the system from data. In this work, we use trajectory optimization and model learning for performing highly dynamic and complex tasks with robotic systems in absence of accurate analytical models of the dynamics. We show that a neural network can model highly nonlinear behaviors accurately for large time horizons, from data collected in only 25 minutes of interactions on two distinct robots: (i) the Boston Dynamics Spot and an (ii) RC car. Furthermore, we use the gradients of the neural network to perform gradient-based trajectory optimization. In our hardware experiments, we demonstrate that our learned model can represent complex dynamics for both the Spot and Radio-controlled (RC) car, and gives good performance in combination with trajectory optimization methods.


Get Back Here: Robust Imitation by Return-to-Distribution Planning

arXiv.org Artificial Intelligence

Imitation Learning (IL) is a paradigm in sequential decision making where an agent uses offline expert trajectories to mimic the expert's behavior [1]. While Reinforcement Learning (RL) requires an additional reward signal that can be hard to specify in practice, IL only requires expert trajectories that can be easier to collect. In part due to its simplicity, IL has been applied successfully in several real world tasks, from robotic manipulation [2, 3, 4] to autonomous driving [5, 6]. A key challenge in deploying IL, however, is that the agent may encounter states in the final deployment environment that were not labeled by the expert offline [7]. In applications such as healthcare [8, 9] and robotics [10, 11], online experimentation can be risky (e.g., on human patients) or costly to label (e.g., off-policy robotic datasets can take months to collect).


Constrained Policy Optimization via Bayesian World Models

arXiv.org Artificial Intelligence

Improving sample-efficiency and safety are crucial challenges when deploying reinforcement learning in high-stakes real world applications. We propose LAMBDA, a novel model-based approach for policy optimization in safety critical tasks modeled via constrained Markov decision processes. Our approach utilizes Bayesian world models, and harnesses the resulting uncertainty to maximize optimistic upper bounds on the task objective, as well as pessimistic upper bounds on the safety constraints. We demonstrate LAMBDA's state of the art performance on the Safety-Gym benchmark suite in terms of sample efficiency and constraint violation.


Combining Pessimism with Optimism for Robust and Efficient Model-Based Deep Reinforcement Learning

arXiv.org Artificial Intelligence

In real-world tasks, reinforcement learning (RL) agents frequently encounter situations that are not present during training time. To ensure reliable performance, the RL agents need to exhibit robustness against worst-case situations. The robust RL framework addresses this challenge via a worst-case optimization between an agent and an adversary. Previous robust RL algorithms are either sample inefficient, lack robustness guarantees, or do not scale to large problems. We propose the Robust Hallucinated Upper-Confidence RL (RH-UCRL) algorithm to provably solve this problem while attaining near-optimal sample complexity guarantees. RH-UCRL is a model-based reinforcement learning (MBRL) algorithm that effectively distinguishes between epistemic and aleatoric uncertainty and efficiently explores both the agent and adversary decision spaces during policy learning. We scale RH-UCRL to complex tasks via neural networks ensemble models as well as neural network policies. Experimentally, we demonstrate that RH-UCRL outperforms other robust deep RL algorithms in a variety of adversarial environments.


Logistic $Q$-Learning

arXiv.org Artificial Intelligence

While REPS is elegantly derived from a principled We propose a new reinforcement learning algorithm linear-programing (LP) formulation of optimal control derived from a regularized linearprogramming in MDPs, it has the serious shortcoming that its faithful formulation of optimal control implementation requires access to the true MDP in MDPs. The method is closely related to for both the policy evaluation and improvement steps, the classic Relative Entropy Policy Search even at deployment time. The usual way to address (REPS) algorithm of Peters et al. (2010), with this limitation is to use an empirical approximation to the key difference that our method introduces the policy evaluation step and to project the policy a Q-function that enables efficient exact from the improvement step into a parametric space model-free implementation. The main (Deisenroth et al., 2013), losing all the theoretical feature of our algorithm (called Q-REPS) is guarantees of REPS in the process.


Efficient Model-Based Reinforcement Learning through Optimistic Policy Search and Planning

arXiv.org Machine Learning

Model-based reinforcement learning algorithms with probabilistic dynamical models are amongst the most data-efficient learning methods. This is often attributed to their ability to distinguish between epistemic and aleatoric uncertainty. However, while most algorithms distinguish these two uncertainties for {\em learning} the model, they ignore it when {\em optimizing} the policy. In this paper, we show that ignoring the epistemic uncertainty leads to greedy algorithms that do not explore sufficiently. In turn, we propose a {\em practical optimistic-exploration algorithm} (\alg), which enlarges the input space with {\em hallucinated} inputs that can exert as much control as the {\em epistemic} uncertainty in the model affords. We analyze this setting and construct a general regret bound for well-calibrated models, which is provably sublinear in the case of Gaussian Process models. Based on this theoretical foundation, we show how optimistic exploration can be easily combined with state-of-the-art reinforcement learning algorithms and different probabilistic models. Our experiments demonstrate that optimistic exploration significantly speeds up learning when there are penalties on actions, a setting that is notoriously difficult for existing model-based reinforcement learning algorithms.


Adaptive Sampling for Stochastic Risk-Averse Learning

arXiv.org Machine Learning

We consider the problem of training machine learning models in a risk-averse manner. In particular, we propose an adaptive sampling algorithm for stochastically optimizing the Conditional Value-at-Risk (CVaR) of a loss distribution. We use a distributionally robust formulation of the CVaR to phrase the problem as a zero-sum game between two players. Our approach solves the game using an efficient no-regret algorithm for each player. Critically, we can apply these algorithms to large-scale settings because the implementation relies on sampling from Determinantal Point Processes. Finally, we empirically demonstrate its effectiveness on large-scale convex and non-convex learning tasks.


Structured Variational Inference in Unstable Gaussian Process State Space Models

arXiv.org Machine Learning

Gaussian processes are expressive, non-parametric statistical models that are well-suited to learn nonlinear dynamical systems. However, large-scale inference in these state space models is a challenging problem. In this paper, we propose CBF-SSM a scalable model that employs a structured variational approximation to maintain temporal correlations. In contrast to prior work, our approach applies to the important class of unstable systems, where state uncertainty grows unbounded over time. For these systems, our method contains a probabilistic, model-based backward pass that infers latent states during training. We demonstrate state-of-the-art performance in our experiments. Moreover, we show that CBF-SSM can be combined with physical models in the form of ordinary differential equations to learn a reliable model of a physical flying robotic vehicle.


Safe Contextual Bayesian Optimization for Sustainable Room Temperature PID Control Tuning

arXiv.org Machine Learning

We tune one of the most common heating, ventilation, and air conditioning (HVAC) control loops, namely the temperature control of a room. For economical and environmental reasons, it is of prime importance to optimize the performance of this system. Buildings account from 20 to 40% of a country energy consumption, and almost 50% of it comes from HVAC systems. Scenario projections predict a 30% decrease in heating consumption by 2050 due to efficiency increase. Advanced control techniques can improve performance; however, the proportional-integral-derivative (PID) control is typically used due to its simplicity and overall performance. We use Safe Contextual Bayesian Optimization to optimize the PID parameters without human intervention. We reduce costs by 32% compared to the current PID controller setting while assuring safety and comfort to people in the room. The results of this work have an immediate impact on the room control loop performances and its related commissioning costs. Furthermore, this successful attempt paves the way for further use at different levels of HVAC systems, with promising energy, operational, and commissioning costs savings, and it is a practical demonstration of the positive effects that Artificial Intelligence can have on environmental sustainability.


Online Variance Reduction with Mixtures

arXiv.org Machine Learning

Adaptive importance sampling for stochastic optimization is a promising approach that offers improved convergence through variance reduction. In this work, we propose a new framework for variance reduction that enables the use of mixtures over predefined sampling distributions, which can naturally encode prior knowledge about the data. While these sampling distributions are fixed, the mixture weights are adapted during the optimization process. We propose VRM, a novel and efficient adaptive scheme that asymptotically recovers the best mixture weights in hindsight and can also accommodate sampling distributions over sets of points. We empirically demonstrate the versatility of VRM in a range of applications.