Reinforcement Learning
Multi-Agent Reinforcement Learning via Double Averaging Primal-Dual Optimization
Wai, Hoi-To, Yang, Zhuoran, Wang, Zhaoran, Hong, Mingyi
Despite the success of single-agent reinforcement learning, multi-agent reinforcement learning (MARL) remains challenging due to complex interactions between agents. Motivated by decentralized applications such as sensor networks, swarm robotics, and power grids, we study policy evaluation in MARL, where agents with jointly observed state-action pairs and private local rewards collaborate to learn the value of a given policy. In this paper, we propose a double averaging scheme, where each agent iteratively performs averaging over both space and time to incorporate neighboring gradient information and local reward information, respectively. We prove that the proposed algorithm converges to the optimal solution at a global geometric rate. In particular, such an algorithm is built upon a primal-dual reformulation of the mean squared projected Bellman error minimization problem, which gives rise to a decentralized convex-concave saddle-point problem. To the best of our knowledge, the proposed double averaging primal-dual optimization algorithm is the first to achieve fast finite-time convergence on decentralized convex-concave saddle-point problems.
Exploration in Structured Reinforcement Learning
Ok, Jungseul, Proutiere, Alexandre, Tranos, Damianos
We address reinforcement learning problems with finite state and action spaces where the underlying MDP has some known structure that could be potentially exploited to minimize the exploration of suboptimal (state, action) pairs. For any arbitrary structure, we derive problem-specific regret lower bounds satisfied by any learning algorithm. These lower bounds are made explicit for unstructured MDPs and for those whose transition probabilities and average reward function are Lipschitz continuous w.r.t. the state and action. For Lipschitz MDPs, the bounds are shown not to scale with the sizes $S$ and $A$ of the state and action spaces, i.e., they are smaller than $c \log T$ where $T$ is the time horizon and the constant $c$ only depends on the Lipschitz structure, the span of the bias function, and the minimal action sub-optimality gap. This contrasts with unstructured MDPs where the regret lower bound typically scales as $SA \log T$ . We devise DEL (Directed Exploration Learning), an algorithm that matches our regret lower bounds. We further simplify the algorithm for Lipschitz MDPs, and show that the simplified version is still able to efficiently exploit the structure.
Internal Model from Observations for Reward Shaping
Kimura, Daiki, Chaudhury, Subhajit, Tachibana, Ryuki, Dasgupta, Sakyasingha
Reinforcement learning methods require careful design involving a reward function to obtain the desired action policy for a given task. In the absence of hand-crafted reward functions, prior work on the topic has proposed several methods for reward estimation by using expert state trajectories and action pairs. However, there are cases where complete or good action information cannot be obtained from expert demonstrations. We propose a novel reinforcement learning method in which the agent learns an internal model of observation on the basis of expert-demonstrated state trajectories to estimate rewards without completely learning the dynamics of the external environment from state-action pairs. The internal model is obtained in the form of a predictive model for the given expert state distribution. During reinforcement learning, the agent predicts the reward as a function of the difference between the actual state and the state predicted by the internal model. We conducted multiple experiments in environments of varying complexity, including the Super Mario Bros and Flappy Bird games. We show our method successfully trains good policies directly from expert game-play videos.
DAQN: Deep Auto-encoder and Q-Network
The deep reinforcement learning method usually requires a large number of training images and executing actions to obtain sufficient results. When it is extended a real-task in the real environment with an actual robot, the method will be required more training images due to complexities or noises of the input images, and executing a lot of actions on the real robot also becomes a serious problem. Therefore, we propose an extended deep reinforcement learning method that is applied a generative model to initialize the network for reducing the number of training trials. In this paper, we used a deep q-network method as the deep reinforcement learning method and a deep auto-encoder as the generative model. We conducted experiments on three different tasks: a cart-pole game, an atari game, and a real-game with an actual robot. The proposed method trained efficiently on all tasks than the previous method, especially 2.5 times faster on a task with real environment images.
Efficient Entropy for Policy Gradient with Multidimensional Action Space
Zhang, Yiming, Vuong, Quan Ho, Song, Kenny, Gong, Xiao-Yue, Ross, Keith W.
In recent years, deep reinforcement learning has been shown to be adept at solving sequential decision processes with high-dimensional state spaces such as in the Atari games. Many reinforcement learning problems, however, involve high-dimensional discrete action spaces as well as high-dimensional state spaces. This paper considers entropy bonus, which is used to encourage exploration in policy gradient. In the case of high-dimensional action spaces, calculating the entropy and its gradient requires enumerating all the actions in the action space and running forward and backpropagation for each action, which may be computationally infeasible. We develop several novel unbiased estimators for the entropy bonus and its gradient. We apply these estimators to several models for the parameterized policies, including Independent Sampling, CommNet, Autoregressive with Modified MDP, and Autoregressive with LSTM. Finally, we test our algorithms on two environments: a multi-hunter multi-rabbit grid game and a multi-agent multi-arm bandit problem. The results show that our entropy estimators substantially improve performance with marginal additional computational cost.
TDM: From model-free to model-based deep reinforcement learning
By Vitchyr Pong You've decided that you want to bike from your house by UC Berkeley to the Golden Gate Bridge. It's a nice 20 mile ride, but there's a problem: you've never ridden a bike before!To make matters worse, you are new to the Bay Area, and all you have is a good ol' fashion map to guide you. How do you get started? Let's first figure out how to ride a bike. One strategy would be to do a lot of studying and planning.
Deep Curiosity Search: Intra-Life Exploration Improves Performance on Challenging Deep Reinforcement Learning Problems
Stanton, Christopher, Clune, Jeff
Traditional exploration methods in RL require agents to perform random actions to find rewards. But these approaches struggle on sparse-reward domains like Montezuma's Revenge where the probability that any random action sequence leads to reward is extremely low. Recent algorithms have performed well on such tasks by encouraging agents to visit new states or perform new actions in relation to all prior training episodes (which we call across-training novelty). But such algorithms do not consider whether an agent exhibits intra-life novelty: doing something new within the current episode, regardless of whether those behaviors have been performed in previous episodes. We hypothesize that across-training novelty might discourage agents from revisiting initially non-rewarding states that could become important stepping stones later in training. We introduce Deep Curiosity Search (DeepCS), which encourages intra-life exploration by rewarding agents for visiting as many different states as possible within each episode, and show that DeepCS matches the performance of current state-of-the-art methods on Montezuma's Revenge. We further show that DeepCS improves exploration on Gravitar (another difficult, sparse-reward game) and performs well on the dense-reward game Amidar. Surprisingly, DeepCS doubles A2C performance on Seaquest, a game we would not have expected to benefit from intra-life exploration because the arena is small and already easily navigated by naive exploration techniques. In one run, DeepCS achieves a maximum training score of 80,000 points on Seaquest, higher than any methods other than Ape-X. The strong performance of DeepCS on these sparse- and dense-reward tasks suggests that encouraging intra-life novelty is an interesting, new approach for improving performance in Deep RL and motivates further research into hybridizing across-training and intra-life exploration methods.
Equivalence Between Wasserstein and Value-Aware Model-based Reinforcement Learning
Asadi, Kavosh, Cater, Evan, Misra, Dipendra, Littman, Michael L.
Learning a generative model is a key component of model-based reinforcement learning. Though learning a good model in the tabular setting is a simple task, learning a useful model in the approximate setting is challenging. Recently Farahmand et al. (2017) proposed a value-aware (VAML) objective that captures the structure of value function during model learning. Using tools from Lipschitz continuity, we show that minimizing the VAML objective is in fact equivalent to minimizing the Wasserstein metric.
Integrating Episodic Memory into a Reinforcement Learning Agent using Reservoir Sampling
Young, Kenny J., Sutton, Richard S., Yang, Shuo
Episodic memory is a psychology term which refers to the ability to recall specific events from the past. We suggest one advantage of this particular type of memory is the ability to easily assign credit to a specific state when remembered information is found to be useful. Inspired by this idea, and the increasing popularity of external memory mechanisms to handle long-term dependencies in deep learning systems, we propose a novel algorithm which uses a reservoir sampling procedure to maintain an external memory consisting of a fixed number of past states. The algorithm allows a deep reinforcement learning agent to learn online to preferentially remember those states which are found to be useful to recall later on. Critically this method allows for efficient online computation of gradient estimates with respect to the write process of the external memory. Thus unlike most prior mechanisms for external memory it is feasible to use in an online reinforcement learning setting. Much of reinforcement learning (RL) theory is based on the assumption that the environment has the Markov property, meaning that future states are independent of past states given the present state. This implies the agent has all the information it needs to make an optimal decision at each time and therefore has no need to remember the past. This is however not realistic in general, realistic problems often require significant information from the past to make an informed decision in the present, and there is often no obvious way to incorporate the relevant information into an expanded present state.
A Reinforcement Learning Approach to Age of Information in Multi-User Networks
Ceran, Elif Tuğçe, Gündüz, Deniz, György, András
We consider a source node that communicates the most up-to-date status packets to multiple users (see Figure 1). We are interested in the average age of information (AoI) [1]-[3] at the users, for a system in which the source node samples an underlying timevarying process and schedules the transmission of the sample values over imperfect links. The AoI at each user at any point in time can simply be defined as the amount of time elapsed since the most recent status update at that user was generated. Most of the earlier work on AoI consider queue-based models, in which the status updates arrive at the source node randomly following a memoryless Poisson process, and are stored in a buffer before being transmitted to the destination [2], [3]. Instead, in the so-called generate-at-will model [1], [4]-[7], also considered in this paper, the status updates of the underlying process of interest can be generated at any time by the source node. AoI in multi-user networks has been studied in [6]- [11]. It is shown in [8] that the scheduling problem for the age minimization is NPhard in general. Scheduling transmissions to multiple receivers is investigated in [7], focusing on a perfect transmission medium, and the optimal scheduling algorithm is shown to be threshold-type. Average AoI has also been studied when status updates over unreliable multi-access channels [10] and multi-cast networks [11] are considered.