Finding the Near Optimal Policy via Adaptive Reduced Regularization in MDPs

Yang, Wenhao, Li, Xiang, Xie, Guangzeng, Zhang, Zhihua

Oct-31-2020–arXiv.org Machine Learning

Reinforcement learning (RL) has achieved great success empirically, especially when policy and value function are parameterized by neural networks. Many studies [16, 21, 24, 11] have shown powerful and striking performance of RL compared to human-level performance. Dynamic Programming [19, 20, 10, 3] and Policy Gradient method [31, 26, 13] are the most frequently used optimization tools in these studies. However, when policy gradient methods are applied, theoretically understanding the success of RL is still limited in the case that policy is searched either on simplex or parameterized space. There is a line of recent work [6, 1, 5] on convergence performance of policy gradient methods for MDPs without parameterization, while another line of recent work [15, 7, 30, 8] focus on MDPs with parameterization. In addition, during the process of learning MDPs, it is often observed that the obtained policy could be quite deterministic while the environment is not fully explored. Some prior works [2, 17, 9, 28] propose to impose the Shannon entropy to each reward to make the policy stochastic, so agent can explore the environment instead of trapping in a local place and achieves success. Intuitively and empirically speaking, adding entropy regularization helps soften the learning process and encourage agents to explore more, so it might fasten convergence.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Machine Learning

Oct-31-2020

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.04)

Genre:
- Research Report > New Finding (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Optimization (0.86)
  - Machine Learning > Reinforcement Learning (0.66)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found