Goto

Collaborating Authors

 pilco


Data-Efficient Reinforcement Learning in Continuous State-Action Gaussian-POMDPs

Neural Information Processing Systems

We present a data-efficient reinforcement learning method for continuous state-action systems under significant observation noise. Data-efficient solutions under small noise exist, such as PILCO which learns the cartpole swing-up task in 30s. PILCO evaluates policies by planning state-trajectories using a dynamics model. However, PILCO applies policies to the observed state, therefore planning in observation space. We extend PILCO with filtering to instead plan in belief space, consistent with partially observable Markov decisions process (POMDP) planning. This enables data-efficient learning under significant observation noise, outperforming more naive methods such as post-hoc application of a filter to policies optimised by the original (unfiltered) PILCO algorithm. We test our method on the cartpole swing-up task, which involves nonlinear dynamics and requires nonlinear control.




Data-Efficient Reinforcement Learning in Continuous State-Action Gaussian-POMDPs

Neural Information Processing Systems

We present a data-efficient reinforcement learning method for continuous state-action systems under significant observation noise. Data-efficient solutions under small noise exist, such as PILCO which learns the cartpole swing-up task in 30s. PILCO evaluates policies by planning state-trajectories using a dynamics model. However, PILCO applies policies to the observed state, therefore planning in observation space. We extend PILCO with filtering to instead plan in belief space, consistent with partially observable Markov decisions process (POMDP) planning. This enables data-efficient learning under significant observation noise, outperforming more naive methods such as post-hoc application of a filter to policies optimised by the original (unfiltered) PILCO algorithm. We test our method on the cartpole swing-up task, which involves nonlinear dynamics and requires nonlinear control.




Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. The proposed approach, while straightforward, quite elegantly handles the problem at hand. What prevents this paper from being a clear cut acceptance is the lack of adequate experimental validation. Typos line 47: draw -> drawn A more thorough discussion of noise in the exploration step of Algorithm 1 (step 8) would be appreciated. This issue is also not discussed in the experiments section (how much noise was used?). I also had a few issues with some of the claimed advantages in the paper. Specifically: (1) The claim that PDDP has an advantage over PILCO since it does not have to solve non-convex optimization problems seems suspect given the non-convexity of the optimization problem solved in the hyper-parameter tuning step.


677e09724f0e2df9b6c000b75b5da10d-AuthorFeedback.pdf

Neural Information Processing Systems

We thank all reviewers for their constructive and helpful comments that will allow us to better shape this paper. We are very thankful for the review and will definitely increase our plot sizes in the final version in case of acceptance. Also, thank you for pointing out real-world experiments. In fact, we plan to take our approach to robotics in the future. We believe self-driving cars present an ideal test-bed for our algorithm.


Probabilistic Differential Dynamic Programming

Yunpeng Pan, Evangelos Theodorou

Neural Information Processing Systems

We present a data-driven, probabilistic trajectory optimization framework for systems with unknown dynamics, called Probabilistic Differential Dynamic Programming (PDDP). PDDP takes into account uncertainty explicitly for dynamics models using Gaussian processes (GPs). Based on the second-order local approximation of the value function, PDDP performs Dynamic Programming around a nominal trajectory in Gaussian belief spaces. Different from typical gradientbased policy search methods, PDDP does not require a policy parameterization and learns a locally optimal, time-varying control policy. We demonstrate the effectiveness and efficiency of the proposed algorithm using two nontrivial tasks. Compared with the classical DDP and a state-of-the-art GP-based policy search method, PDDP offers a superior combination of data-efficiency, learning speed, and applicability.


Reviews: Total stochastic gradient algorithms and applications in reinforcement learning

Neural Information Processing Systems

This paper provides another formalism for gradient estimation in probabilistic computation graphs. Using pathwise derivative and likelihood ratio estimators, existing and well-known policy gradient theorems are cast into the proposed formalism. This intuition is then used to propose two new methods for gradient estimation that can be used in a model-based RL framework. Some results are shown that demonstrate comparable results to PILCO on the cart-pole task. Quality: the idea in this work is interesting, and the proposed framework and methods may prove useful in RL settings.