AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Multi-Agent Reinforcement Learning via Double Averaging Primal-Dual Optimization

Wai, Hoi-To, Yang, Zhuoran, Wang, Zhaoran, Hong, Mingyi

arXiv.org Machine LearningJun-3-2018

Despite the success of single-agent reinforcement learning, multi-agent reinforcement learning (MARL) remains challenging due to complex interactions between agents. Motivated by decentralized applications such as sensor networks, swarm robotics, and power grids, we study policy evaluation in MARL, where agents with jointly observed state-action pairs and private local rewards collaborate to learn the value of a given policy. In this paper, we propose a double averaging scheme, where each agent iteratively performs averaging over both space and time to incorporate neighboring gradient information and local reward information, respectively. We prove that the proposed algorithm converges to the optimal solution at a global geometric rate. In particular, such an algorithm is built upon a primal-dual reformulation of the mean squared projected Bellman error minimization problem, which gives rise to a decentralized convex-concave saddle-point problem. To the best of our knowledge, the proposed double averaging primal-dual optimization algorithm is the first to achieve fast finite-time convergence on decentralized convex-concave saddle-point problems.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

arXiv.org Machine Learning

1806.00877

Country: North America > United States (0.28)

Genre:

Research Report (0.50)
Workflow (0.46)

Industry: Energy > Power Industry (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Exploration in Structured Reinforcement Learning

Ok, Jungseul, Proutiere, Alexandre, Tranos, Damianos

arXiv.org Machine LearningJun-3-2018

We address reinforcement learning problems with finite state and action spaces where the underlying MDP has some known structure that could be potentially exploited to minimize the exploration of suboptimal (state, action) pairs. For any arbitrary structure, we derive problem-specific regret lower bounds satisfied by any learning algorithm. These lower bounds are made explicit for unstructured MDPs and for those whose transition probabilities and average reward function are Lipschitz continuous w.r.t. the state and action. For Lipschitz MDPs, the bounds are shown not to scale with the sizes $S$ and $A$ of the state and action spaces, i.e., they are smaller than $c \log T$ where $T$ is the time horizon and the constant $c$ only depends on the Lipschitz structure, the span of the bias function, and the minimal action sub-optimality gap. This contrasts with unstructured MDPs where the regret lower bound typically scales as $SA \log T$ . We devise DEL (Directed Exploration Learning), an algorithm that matches our regret lower bounds. We further simplify the algorithm for Lipschitz MDPs, and show that the simplified version is still able to efficiently exploit the structure.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

1806.00775

Genre: Research Report (0.64)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Internal Model from Observations for Reward Shaping

Kimura, Daiki, Chaudhury, Subhajit, Tachibana, Ryuki, Dasgupta, Sakyasingha

arXiv.org Machine LearningJun-2-2018

Reinforcement learning methods require careful design involving a reward function to obtain the desired action policy for a given task. In the absence of hand-crafted reward functions, prior work on the topic has proposed several methods for reward estimation by using expert state trajectories and action pairs. However, there are cases where complete or good action information cannot be obtained from expert demonstrations. We propose a novel reinforcement learning method in which the agent learns an internal model of observation on the basis of expert-demonstrated state trajectories to estimate rewards without completely learning the dynamics of the external environment from state-action pairs. The internal model is obtained in the form of a predictive model for the given expert state distribution. During reinforcement learning, the agent predicts the reward as a function of the difference between the actual state and the state predicted by the internal model. We conducted multiple experiments in environments of varying complexity, including the Super Mario Bros and Flappy Bird games. We show our method successfully trains good policies directly from expert game-play videos.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Machine Learning

1806.01267

Country:

North America (0.46)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.15)

Genre: Research Report (0.64)

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DAQN: Deep Auto-encoder and Q-Network

Kimura, Daiki

arXiv.org Machine LearningJun-2-2018

The deep reinforcement learning method usually requires a large number of training images and executing actions to obtain sufficient results. When it is extended a real-task in the real environment with an actual robot, the method will be required more training images due to complexities or noises of the input images, and executing a lot of actions on the real robot also becomes a serious problem. Therefore, we propose an extended deep reinforcement learning method that is applied a generative model to initialize the network for reducing the number of training trials. In this paper, we used a deep q-network method as the deep reinforcement learning method and a deep auto-encoder as the generative model. We conducted experiments on three different tasks: a cart-pole game, an atari game, and a real-game with an actual robot. The proposed method trained efficiently on all tasks than the previous method, especially 2.5 times faster on a task with real environment images.

machine learning, reinforcement, reinforcement learning, (17 more...)

arXiv.org Machine Learning

1806.0063

Country: North America (0.28)

Genre: Research Report (0.83)

Industry: Leisure & Entertainment > Games > Computer Games (0.90)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Efficient Entropy for Policy Gradient with Multidimensional Action Space

Zhang, Yiming, Vuong, Quan Ho, Song, Kenny, Gong, Xiao-Yue, Ross, Keith W.

arXiv.org Machine LearningJun-2-2018

In recent years, deep reinforcement learning has been shown to be adept at solving sequential decision processes with high-dimensional state spaces such as in the Atari games. Many reinforcement learning problems, however, involve high-dimensional discrete action spaces as well as high-dimensional state spaces. This paper considers entropy bonus, which is used to encourage exploration in policy gradient. In the case of high-dimensional action spaces, calculating the entropy and its gradient requires enumerating all the actions in the action space and running forward and backpropagation for each action, which may be computationally infeasible. We develop several novel unbiased estimators for the entropy bonus and its gradient. We apply these estimators to several models for the parameterized policies, including Independent Sampling, CommNet, Autoregressive with Modified MDP, and Autoregressive with LSTM. Finally, we test our algorithms on two environments: a multi-hunter multi-rabbit grid game and a multi-agent multi-arm bandit problem. The results show that our entropy estimators substantially improve performance with marginal additional computational cost.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Machine Learning

1806.00589

Country: North America > United States > Massachusetts (0.28)

Genre: Research Report (0.70)

Industry: Leisure & Entertainment > Games > Computer Games (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.79)

Add feedback

TDM: From model-free to model-based deep reinforcement learning

RobohubJun-1-2018, 01:45:09 GMT

By Vitchyr Pong You've decided that you want to bike from your house by UC Berkeley to the Golden Gate Bridge. It's a nice 20 mile ride, but there's a problem: you've never ridden a bike before!To make matters worse, you are new to the Bay Area, and all you have is a good ol' fashion map to guide you. How do you get started? Let's first figure out how to ride a bike. One strategy would be to do a lot of studying and planning.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

Robohub

Country: Pacific Ocean > North Pacific Ocean > San Francisco Bay > Golden Gate (0.28)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Deep Curiosity Search: Intra-Life Exploration Improves Performance on Challenging Deep Reinforcement Learning Problems

Stanton, Christopher, Clune, Jeff

arXiv.org Artificial IntelligenceJun-1-2018

Traditional exploration methods in RL require agents to perform random actions to find rewards. But these approaches struggle on sparse-reward domains like Montezuma's Revenge where the probability that any random action sequence leads to reward is extremely low. Recent algorithms have performed well on such tasks by encouraging agents to visit new states or perform new actions in relation to all prior training episodes (which we call across-training novelty). But such algorithms do not consider whether an agent exhibits intra-life novelty: doing something new within the current episode, regardless of whether those behaviors have been performed in previous episodes. We hypothesize that across-training novelty might discourage agents from revisiting initially non-rewarding states that could become important stepping stones later in training. We introduce Deep Curiosity Search (DeepCS), which encourages intra-life exploration by rewarding agents for visiting as many different states as possible within each episode, and show that DeepCS matches the performance of current state-of-the-art methods on Montezuma's Revenge. We further show that DeepCS improves exploration on Gravitar (another difficult, sparse-reward game) and performs well on the dense-reward game Amidar. Surprisingly, DeepCS doubles A2C performance on Seaquest, a game we would not have expected to benefit from intra-life exploration because the arena is small and already easily navigated by naive exploration techniques. In one run, DeepCS achieves a maximum training score of 80,000 points on Seaquest, higher than any methods other than Ape-X. The strong performance of DeepCS on these sparse- and dense-reward tasks suggests that encouraging intra-life novelty is an interesting, new approach for improving performance in Deep RL and motivates further research into hybridizing across-training and intra-life exploration methods.

agent, deepcs, exploration, (14 more...)

arXiv.org Artificial Intelligence

1806.00553

Country: North America > United States > Wyoming > Albany County > Laramie (0.14)

Genre: Research Report > Promising Solution (0.48)

Industry:

Leisure & Entertainment > Games (1.00)
Education > Focused Education > Special Education (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Equivalence Between Wasserstein and Value-Aware Model-based Reinforcement Learning

Asadi, Kavosh, Cater, Evan, Misra, Dipendra, Littman, Michael L.

arXiv.org Machine LearningJun-1-2018

Learning a generative model is a key component of model-based reinforcement learning. Though learning a good model in the tabular setting is a simple task, learning a useful model in the approximate setting is challenging. Recently Farahmand et al. (2017) proposed a value-aware (VAML) objective that captures the structure of value function during model learning. Using tools from Lipschitz continuity, we show that minimizing the VAML objective is in fact equivalent to minimizing the Wasserstein metric.

artificial intelligence, machine learning, reinforcement learning, (12 more...)

arXiv.org Machine Learning

1806.01265

Country:

North America > Canada > Quebec (0.15)
North America > United States > Texas (0.14)
Europe > United Kingdom > Scotland (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.77)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.48)

Add feedback

Integrating Episodic Memory into a Reinforcement Learning Agent using Reservoir Sampling

Young, Kenny J., Sutton, Richard S., Yang, Shuo

arXiv.org Machine LearningJun-1-2018

Episodic memory is a psychology term which refers to the ability to recall specific events from the past. We suggest one advantage of this particular type of memory is the ability to easily assign credit to a specific state when remembered information is found to be useful. Inspired by this idea, and the increasing popularity of external memory mechanisms to handle long-term dependencies in deep learning systems, we propose a novel algorithm which uses a reservoir sampling procedure to maintain an external memory consisting of a fixed number of past states. The algorithm allows a deep reinforcement learning agent to learn online to preferentially remember those states which are found to be useful to recall later on. Critically this method allows for efficient online computation of gradient estimates with respect to the write process of the external memory. Thus unlike most prior mechanisms for external memory it is feasible to use in an online reinforcement learning setting. Much of reinforcement learning (RL) theory is based on the assumption that the environment has the Markov property, meaning that future states are independent of past states given the present state. This implies the agent has all the information it needs to make an optimal decision at each time and therefore has no need to remember the past. This is however not realistic in general, realistic problems often require significant information from the past to make an informed decision in the present, and there is often no obvious way to incorporate the relevant information into an expanded present state.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

arXiv.org Machine Learning

1806.0054

Country: North America > Canada > Alberta (0.28)

Genre: Research Report (0.50)

Industry: Health & Medicine > Consumer Health (0.61)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)

Add feedback

A Reinforcement Learning Approach to Age of Information in Multi-User Networks

Ceran, Elif Tuğçe, Gündüz, Deniz, György, András

arXiv.org Machine LearningJun-1-2018

We consider a source node that communicates the most up-to-date status packets to multiple users (see Figure 1). We are interested in the average age of information (AoI) [1]-[3] at the users, for a system in which the source node samples an underlying timevarying process and schedules the transmission of the sample values over imperfect links. The AoI at each user at any point in time can simply be defined as the amount of time elapsed since the most recent status update at that user was generated. Most of the earlier work on AoI consider queue-based models, in which the status updates arrive at the source node randomly following a memoryless Poisson process, and are stored in a buffer before being transmitted to the destination [2], [3]. Instead, in the so-called generate-at-will model [1], [4]-[7], also considered in this paper, the status updates of the underlying process of interest can be generated at any time by the source node. AoI in multi-user networks has been studied in [6]- [11]. It is shown in [8] that the scheduling problem for the age minimization is NPhard in general. Scheduling transmissions to multiple receivers is investigated in [7], focusing on a perfect transmission medium, and the optimal scheduling algorithm is shown to be threshold-type. Average AoI has also been studied when status updates over unreliable multi-access channels [10] and multi-cast networks [11] are considered.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

arXiv.org Machine Learning

1806.00336

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback