pilco
Data-Efficient Reinforcement Learning in Continuous State-Action Gaussian-POMDPs
We present a data-efficient reinforcement learning method for continuous state-action systems under significant observation noise. Data-efficient solutions under small noise exist, such as PILCO which learns the cartpole swing-up task in 30s. PILCO evaluates policies by planning state-trajectories using a dynamics model. However, PILCO applies policies to the observed state, therefore planning in observation space. We extend PILCO with filtering to instead plan in belief space, consistent with partially observable Markov decisions process (POMDP) planning. This enables data-efficient learning under significant observation noise, outperforming more naive methods such as post-hoc application of a filter to policies optimised by the original (unfiltered) PILCO algorithm. We test our method on the cartpole swing-up task, which involves nonlinear dynamics and requires nonlinear control.
- Asia > Japan > Kyūshū & Okinawa > Okinawa (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.50)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.41)
Data-Efficient Reinforcement Learning in Continuous State-Action Gaussian-POMDPs
We present a data-efficient reinforcement learning method for continuous state-action systems under significant observation noise. Data-efficient solutions under small noise exist, such as PILCO which learns the cartpole swing-up task in 30s. PILCO evaluates policies by planning state-trajectories using a dynamics model. However, PILCO applies policies to the observed state, therefore planning in observation space. We extend PILCO with filtering to instead plan in belief space, consistent with partially observable Markov decisions process (POMDP) planning. This enables data-efficient learning under significant observation noise, outperforming more naive methods such as post-hoc application of a filter to policies optimised by the original (unfiltered) PILCO algorithm. We test our method on the cartpole swing-up task, which involves nonlinear dynamics and requires nonlinear control.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Robots (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)
- Asia > Japan > Kyūshū & Okinawa > Okinawa (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.50)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.41)
Export Reviews, Discussions, Author Feedback and Meta-Reviews
First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. The proposed approach, while straightforward, quite elegantly handles the problem at hand. What prevents this paper from being a clear cut acceptance is the lack of adequate experimental validation. Typos line 47: draw -> drawn A more thorough discussion of noise in the exploration step of Algorithm 1 (step 8) would be appreciated. This issue is also not discussed in the experiments section (how much noise was used?). I also had a few issues with some of the claimed advantages in the paper. Specifically: (1) The claim that PDDP has an advantage over PILCO since it does not have to solve non-convex optimization problems seems suspect given the non-convexity of the optimization problem solved in the hyper-parameter tuning step.
677e09724f0e2df9b6c000b75b5da10d-AuthorFeedback.pdf
We thank all reviewers for their constructive and helpful comments that will allow us to better shape this paper. We are very thankful for the review and will definitely increase our plot sizes in the final version in case of acceptance. Also, thank you for pointing out real-world experiments. In fact, we plan to take our approach to robotics in the future. We believe self-driving cars present an ideal test-bed for our algorithm.
Probabilistic Differential Dynamic Programming
Yunpeng Pan, Evangelos Theodorou
We present a data-driven, probabilistic trajectory optimization framework for systems with unknown dynamics, called Probabilistic Differential Dynamic Programming (PDDP). PDDP takes into account uncertainty explicitly for dynamics models using Gaussian processes (GPs). Based on the second-order local approximation of the value function, PDDP performs Dynamic Programming around a nominal trajectory in Gaussian belief spaces. Different from typical gradientbased policy search methods, PDDP does not require a policy parameterization and learns a locally optimal, time-varying control policy. We demonstrate the effectiveness and efficiency of the proposed algorithm using two nontrivial tasks. Compared with the classical DDP and a state-of-the-art GP-based policy search method, PDDP offers a superior combination of data-efficiency, learning speed, and applicability.
Reviews: Total stochastic gradient algorithms and applications in reinforcement learning
This paper provides another formalism for gradient estimation in probabilistic computation graphs. Using pathwise derivative and likelihood ratio estimators, existing and well-known policy gradient theorems are cast into the proposed formalism. This intuition is then used to propose two new methods for gradient estimation that can be used in a model-based RL framework. Some results are shown that demonstrate comparable results to PILCO on the cart-pole task. Quality: the idea in this work is interesting, and the proposed framework and methods may prove useful in RL settings.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.40)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.40)