AITopics

1901.10634

Country:

North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Machine LearningJan-29-2019

Trust Region-Guided Proximal Policy Optimization

Wang, Yuhui, He, Hao, Tan, Xiaoyang, Gan, Yaozhong

Model-free reinforcement learning relies heavily on a safe yet exploratory policy search. Proximal policy optimization (PPO) is a prominent algorithm to address the safe search problem, by exploiting a heuristic clipping mechanism motivated by a theoretically-justified "trust region" guidance. However, we found that the clipping mechanism of PPO could lead to a lack of exploration issue. Based on this finding, we improve the original PPO with an adaptive clipping mechanism guided by a "trust region" criterion. Our method, termed as Trust Region-Guided PPO (TRPPO), improves PPO with more exploration and better sample efficiency, while maintains the safe search property and design simplicity of PPO. On several benchmark tasks, TRPPO significantly outperforms the original PPO and is competitive with several state-of-the-art methods.

ppo, st kl, trppo, (15 more...)

1901.10314

Country:

Asia > Middle East > Jordan (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.89)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.70)

Xu, Tianbing, Zhang, Andrew, Zhao, Liang

WALL-E: An Efficient Reinforcement Learning Research Framework

arXiv.org Machine LearningJan-28-2019

Overall, reinforcement learning (RL) involves an agent interacting with an environment through repeatedly running a policy π, collecting experience from each iteration and using that experience to update its policy for maximal reward (Fig 1). Figure 1: RL flow chart Thanks to advancements in big data, computing power, and other machine learning discipline, reinforcement learning has emerged as the pinnacle field in pushing humanity closer to true artificial intelligence.Model-based reinforcement learning, for example, aims to build an accurate model (such as a MDP) of the environment dynamics and train the agent on said model, giving model learning capabilities as well as ease of reward learning. On the other hand, in model-free reinforcement learning, the agent does not have explicit information regarding state transitions and must continuously explore and generate experience to find the optimal policy. In recent years, major problems have arisen in the field of reinforcement learning, such as planning and how to balance exploration and exploitation. Of particular interest, however, is the problem of knowledge gathering, namely how to efficiently and quickly sample trajectories to gain experience and update the policy without adversely affecting average return.

efficient reinforcement learning research framework, experience collection time, reinforcement learning research framework, (10 more...)

1901.06086

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Melnik, Andrew, Fleer, Sascha, Schilling, Malte, Ritter, Helge

Modularization of End-to-End Learning: Case Study in Arcade Games

Complex environments and tasks pose a difficult problem for holistic end-to-end learning approaches. Decomposition of an environment into interacting controllable and non-controllable objects allows supervised learning for non-controllable objects and universal value function approximator learning for controllable objects. Such decomposition should lead to a shorter learning time and better generalisation capability. Here, we consider arcade-game environments as sets of interacting objects (controllable, non-controllable) and propose a set of functional modules that are specialized on mastering different types of interactions in a broad range of environments. The modules utilize regression, supervised learning, and reinforcement learning algorithms. Results of this case study in different Atari games suggest that human-level performance can be achieved by a learning agent within a human amount of game experience (10-15 minutes game time) when a proper decomposition of an environment or a task is provided. However, automatization of such decomposition remains a challenging problem. This case study shows how a model of a causal structure underlying an environment or a task can benefit learning time and generalization capability of the agent, and argues in favor of exploiting modular structure in contrast to using pure end-to-end learning approaches.

functional module, interaction primitive, trajectory, (11 more...)

1901.09895

Country:

Europe > Germany (0.06)
North America > United States > Massachusetts (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (0.50)

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.72)

Gelada, Carles, Bellemare, Marc G.

Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift

In this paper we revisit the method of off-policy corrections for reinforcement learning (COP-TD) pioneered by Hallak et al. (2017). Under this method, online updates to the value function are reweighted to avoid divergence issues typical of off-policy learning. While Hallak et al.'s solution is appealing, it cannot easily be transferred to nonlinear function approximation. First, it requires a projection step onto the probability simplex; second, even though the operator describing the expected behavior of the off-policy learning algorithm is convergent, it is not known to be a contraction mapping, and hence, may be more unstable in practice. We address these two issues by introducing a discount factor into COP-TD. We analyze the behavior of discounted COP-TD and find it better behaved from a theoretical perspective. We also propose an alternative soft normalization penalty that can be minimized online and obviates the need for an explicit projection step. We complement our analysis with an empirical evaluation of the two techniques in an off-policy setting on the game Pong from the Atari domain where we find discounted COP-TD to be better behaved in practice than the soft normalization penalty. Finally, we perform a more extensive evaluation of discounted COP-TD in 5 games of the Atari domain, where we find performance gains for our approach.

function approximation, learning, operator, (14 more...)

1901.09455

Country: North America > United States > New Jersey > Mercer County > Princeton (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.40)

Reward Shaping via Meta-Learning

Zou, Haosheng, Ren, Tongzheng, Yan, Dong, Su, Hang, Zhu, Jun

Reward shaping is one of the most effective methods to tackle the crucial yet challenging problem of credit assignment in Reinforcement Learning (RL). However, designing shaping functions usually requires much expert knowledge and hand-engineering, and the difficulties are further exacerbated given multiple similar tasks to solve. In this paper, we consider reward shaping on a distribution of tasks, and propose a general meta-learning framework to automatically learn the efficient reward shaping on newly sampled tasks, assuming only shared state space but not necessarily action space. We first derive the theoretically optimal reward shaping in terms of credit assignment in model-free RL. We then propose a value-based meta-learning algorithm to extract an effective prior over the optimal reward shaping. The prior can be applied directly to new tasks, or provably adapted to the task-posterior while solving the task within few gradient updates. We demonstrate the effectiveness of our shaping through significantly improved learning efficiency and interpretable visualizations across various settings, including notably a successful transfer from DQN to DDPG.

credit assignment, eqn, reward shaping, (15 more...)

1901.0933

Country: Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.64)

Industry: Leisure & Entertainment > Games > Computer Games (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Value Propagation for Decentralized Networked Deep Multi-agent Reinforcement Learning

Qu, Chao, Mannor, Shie, Xu, Huan, Qi, Yuan, Song, Le, Xiong, Junwu

We consider the networked multi-agent reinforcement learning (MARL) problem in a fully decentralized setting, where agents learn to coordinate to achieve the joint success. This problem is widely encountered in many areas including traffic control, distributed control, and smart grids. We assume that the reward function for each agent can be different and observed only locally by the agent itself. Furthermore, each agent is located at a node of a communication network and can exchanges information only with its neighbors. Using softmax temporal consistency and a decentralized optimization method, we obtain a principled and data-efficient iterative algorithm. In the first step of each iteration, an agent computes its local policy and value gradients and then updates only policy parameters. In the second step, the agent propagates to its neighbors the messages based on its value function and then updates its own value function. Hence we name the algorithm value propagation. We prove a non-asymptotic convergence rate 1/T with the nonlinear function approximation. To the best of our knowledge, it is the first MARL algorithm with convergence guarantee in the control, off-policy and non-linear function approximation setting. We empirically demonstrate the effectiveness of our approach in experiments.

agent, propagation, value propagation, (12 more...)

1901.09326

Country: North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report (0.64)

Industry: Energy > Power Industry (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.67)

#artificialintelligenceJan-26-2019, 21:18:21 GMT

AI Helps Amputees Walk With a Robotic Knee Web Design & Website Hosting Services Creative Digital Agency Mean Web Host

A movie montage for modern artificial intelligence might show a computer playing millions of games of chess or Go against itself to learn how to win. Now, researchers are exploring how the reinforcement learning technique that helped DeepMind's AlphaZero conquer the chess and Go could tackle an even more complex task--training a robotic knee to help amputees walk smoothly. You must log in to article a comment. This site uses Akismet to reduce spam. Learn how your comment data is processed.

large language model, machine learning, reinforcement learning, (4 more...)

#artificialintelligence

Industry: Leisure & Entertainment > Games > Chess (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.36)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.36)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.36)

#artificialintelligenceJan-26-2019, 16:05:54 GMT

r/MachineLearning - [1901.08162] Causal Reasoning from Meta-reinforcement Learning

Abstract: Discovering and exploiting the causal structure in the environment is a crucial challenge for intelligent agents. Here we explore whether causal reasoning can emerge via meta-reinforcement learning. We train a recurrent network with model-free reinforcement learning to solve a range of problems that each contain causal structure. We find that the trained agent can perform causal reasoning in novel situations in order to obtain rewards. The agent can select informative interventions, draw causal inferences from observational data, and make counterfactual predictions.

artificial intelligence, machine learning, meta-reinforcement learning, (4 more...)

#artificialintelligence

Industry: Media > News (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Model-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

#artificialintelligenceJan-26-2019, 01:34:56 GMT

r/MachineLearning - [D] Deepening your theoretical knowledge of DL, ML, and RL

Are you guys interested in making your theoretical foundations of Deep Learning, Machine Learning, and Reinforcement Learning strong enough so that you can do ML with confidence? I have compiled a list of awesome lectures starting from 2012-till date, and the list is continuously growing. Please find the courses in my GitHub repo Deep Learning Drizzle. You're welcome to share it with anyone who might be curious to know these techniques in depth. Feel free to star or fork it & also please send a PR if you have some suggestions!

artificial intelligence, machine learning, reinforcement learning, (5 more...)

#artificialintelligence

Industry: Media > News (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.33)