Randomized Policy Learning for Continuous State and Action MDPs

Sharma, Hiteshi, Jain, Rahul

arXiv.org Artificial Intelligence 

Recently, for continuous control tasks, reinforcement learning (RL) algorithms based on actor-critic architecture [9] or policy optimization [16] have shown remarkably good performance. The policy and the value function are represented by deep neural networks and then the weights are updated accordingly. However, [7] shows that the performance of these RL algorithms vary a lot with changes in hyperparameters, network architecture etc. Furthermore, [10] showed that a simple linear policy-based method with weights updated by a random search method can outperform some of these state-of-the-art results. A key question is how far we can go by relying almost exclusively on these architectural biases. For Markov decision processes (MDPs) with discrete state and action spaces, model-based algorithms based on dynamic programming (DP) ideas [13] can be used when the model is known. Unfortunately, in many problems (e.g., robotics), the system model is unknown, or simply too complicated to be succinctly stated and used in DP algorithms. Usually, latter is the more likely case.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found