Reinforcement Learning
Private Q-Learning with Functional Noise in Continuous Spaces
We consider privacy-preserving algorithms for deep reinforcement learning. State-of-the-art methods that guarantee differential privacy are not extendable to very large state spaces because the noise level necessary to ensure privacy would scale to infinity. We address the problem of providing differential privacy in Q-learning where a function approximation through a neural network is used for parametrization. We develop a rigorous and efficient algorithm by inspecting the reproducing kernel Hilbert space in which the neural network is embedded. Our approach uses functional noise to guarantee privacy, while the noise level scales linearly with the complexity of the neural network architecture. There are no known theoretical guarantees on the performance of deep reinforcement learning, but we gain some insight by providing a utility analysis under the discrete space setting.
Trust Region-Guided Proximal Policy Optimization
Wang, Yuhui, He, Hao, Tan, Xiaoyang, Gan, Yaozhong
Model-free reinforcement learning relies heavily on a safe yet exploratory policy search. Proximal policy optimization (PPO) is a prominent algorithm to address the safe search problem, by exploiting a heuristic clipping mechanism motivated by a theoretically-justified "trust region" guidance. However, we found that the clipping mechanism of PPO could lead to a lack of exploration issue. Based on this finding, we improve the original PPO with an adaptive clipping mechanism guided by a "trust region" criterion. Our method, termed as Trust Region-Guided PPO (TRPPO), improves PPO with more exploration and better sample efficiency, while maintains the safe search property and design simplicity of PPO. On several benchmark tasks, TRPPO significantly outperforms the original PPO and is competitive with several state-of-the-art methods.
WALL-E: An Efficient Reinforcement Learning Research Framework
Xu, Tianbing, Zhang, Andrew, Zhao, Liang
Overall, reinforcement learning (RL) involves an agent interacting with an environment through repeatedly running a policy ฯ, collecting experience from each iteration and using that experience to update its policy for maximal reward (Fig 1). Figure 1: RL flow chart Thanks to advancements in big data, computing power, and other machine learning discipline, reinforcement learning has emerged as the pinnacle field in pushing humanity closer to true artificial intelligence.Model-based reinforcement learning, for example, aims to build an accurate model (such as a MDP) of the environment dynamics and train the agent on said model, giving model learning capabilities as well as ease of reward learning. On the other hand, in model-free reinforcement learning, the agent does not have explicit information regarding state transitions and must continuously explore and generate experience to find the optimal policy. In recent years, major problems have arisen in the field of reinforcement learning, such as planning and how to balance exploration and exploitation. Of particular interest, however, is the problem of knowledge gathering, namely how to efficiently and quickly sample trajectories to gain experience and update the policy without adversely affecting average return.
Modularization of End-to-End Learning: Case Study in Arcade Games
Melnik, Andrew, Fleer, Sascha, Schilling, Malte, Ritter, Helge
Complex environments and tasks pose a difficult problem for holistic end-to-end learning approaches. Decomposition of an environment into interacting controllable and non-controllable objects allows supervised learning for non-controllable objects and universal value function approximator learning for controllable objects. Such decomposition should lead to a shorter learning time and better generalisation capability. Here, we consider arcade-game environments as sets of interacting objects (controllable, non-controllable) and propose a set of functional modules that are specialized on mastering different types of interactions in a broad range of environments. The modules utilize regression, supervised learning, and reinforcement learning algorithms. Results of this case study in different Atari games suggest that human-level performance can be achieved by a learning agent within a human amount of game experience (10-15 minutes game time) when a proper decomposition of an environment or a task is provided. However, automatization of such decomposition remains a challenging problem. This case study shows how a model of a causal structure underlying an environment or a task can benefit learning time and generalization capability of the agent, and argues in favor of exploiting modular structure in contrast to using pure end-to-end learning approaches.
Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift
Gelada, Carles, Bellemare, Marc G.
In this paper we revisit the method of off-policy corrections for reinforcement learning (COP-TD) pioneered by Hallak et al. (2017). Under this method, online updates to the value function are reweighted to avoid divergence issues typical of off-policy learning. While Hallak et al.'s solution is appealing, it cannot easily be transferred to nonlinear function approximation. First, it requires a projection step onto the probability simplex; second, even though the operator describing the expected behavior of the off-policy learning algorithm is convergent, it is not known to be a contraction mapping, and hence, may be more unstable in practice. We address these two issues by introducing a discount factor into COP-TD. We analyze the behavior of discounted COP-TD and find it better behaved from a theoretical perspective. We also propose an alternative soft normalization penalty that can be minimized online and obviates the need for an explicit projection step. We complement our analysis with an empirical evaluation of the two techniques in an off-policy setting on the game Pong from the Atari domain where we find discounted COP-TD to be better behaved in practice than the soft normalization penalty. Finally, we perform a more extensive evaluation of discounted COP-TD in 5 games of the Atari domain, where we find performance gains for our approach.
Reward Shaping via Meta-Learning
Zou, Haosheng, Ren, Tongzheng, Yan, Dong, Su, Hang, Zhu, Jun
Reward shaping is one of the most effective methods to tackle the crucial yet challenging problem of credit assignment in Reinforcement Learning (RL). However, designing shaping functions usually requires much expert knowledge and hand-engineering, and the difficulties are further exacerbated given multiple similar tasks to solve. In this paper, we consider reward shaping on a distribution of tasks, and propose a general meta-learning framework to automatically learn the efficient reward shaping on newly sampled tasks, assuming only shared state space but not necessarily action space. We first derive the theoretically optimal reward shaping in terms of credit assignment in model-free RL. We then propose a value-based meta-learning algorithm to extract an effective prior over the optimal reward shaping. The prior can be applied directly to new tasks, or provably adapted to the task-posterior while solving the task within few gradient updates. We demonstrate the effectiveness of our shaping through significantly improved learning efficiency and interpretable visualizations across various settings, including notably a successful transfer from DQN to DDPG.
Value Propagation for Decentralized Networked Deep Multi-agent Reinforcement Learning
Qu, Chao, Mannor, Shie, Xu, Huan, Qi, Yuan, Song, Le, Xiong, Junwu
We consider the networked multi-agent reinforcement learning (MARL) problem in a fully decentralized setting, where agents learn to coordinate to achieve the joint success. This problem is widely encountered in many areas including traffic control, distributed control, and smart grids. We assume that the reward function for each agent can be different and observed only locally by the agent itself. Furthermore, each agent is located at a node of a communication network and can exchanges information only with its neighbors. Using softmax temporal consistency and a decentralized optimization method, we obtain a principled and data-efficient iterative algorithm. In the first step of each iteration, an agent computes its local policy and value gradients and then updates only policy parameters. In the second step, the agent propagates to its neighbors the messages based on its value function and then updates its own value function. Hence we name the algorithm value propagation. We prove a non-asymptotic convergence rate 1/T with the nonlinear function approximation. To the best of our knowledge, it is the first MARL algorithm with convergence guarantee in the control, off-policy and non-linear function approximation setting. We empirically demonstrate the effectiveness of our approach in experiments.
AI Helps Amputees Walk With a Robotic Knee Web Design & Website Hosting Services Creative Digital Agency Mean Web Host
A movie montage for modern artificial intelligence might show a computer playing millions of games of chess or Go against itself to learn how to win. Now, researchers are exploring how the reinforcement learning technique that helped DeepMind's AlphaZero conquer the chess and Go could tackle an even more complex task--training a robotic knee to help amputees walk smoothly. You must log in to article a comment. This site uses Akismet to reduce spam. Learn how your comment data is processed.
r/MachineLearning - [1901.08162] Causal Reasoning from Meta-reinforcement Learning
Abstract: Discovering and exploiting the causal structure in the environment is a crucial challenge for intelligent agents. Here we explore whether causal reasoning can emerge via meta-reinforcement learning. We train a recurrent network with model-free reinforcement learning to solve a range of problems that each contain causal structure. We find that the trained agent can perform causal reasoning in novel situations in order to obtain rewards. The agent can select informative interventions, draw causal inferences from observational data, and make counterfactual predictions.
r/MachineLearning - [D] Deepening your theoretical knowledge of DL, ML, and RL
Are you guys interested in making your theoretical foundations of Deep Learning, Machine Learning, and Reinforcement Learning strong enough so that you can do ML with confidence? I have compiled a list of awesome lectures starting from 2012-till date, and the list is continuously growing. Please find the courses in my GitHub repo Deep Learning Drizzle. You're welcome to share it with anyone who might be curious to know these techniques in depth. Feel free to star or fork it & also please send a PR if you have some suggestions!