AITopics

We introduce a new algorithm for reinforcement learning called Maximum aposteriori Policy Optimisation (MPO) based on coordinate ascent on a relative entropy objective. We show that several existing methods can directly be related to our derivation. We develop two off-policy algorithms and demonstrate that they are competitive with the state-of-the-art in deep reinforcement learning. In particular, for continuous control, our method outperforms existing methods with respect to sample efficiency, premature convergence and robustness to hyperparameter settings while achieving similar or better final performance.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

1806.0692

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Mexico > Quintana Roo > Cancún (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Qualitative Measurements of Policy Discrepancy for Return-based Deep Q-Network

Meng, Wenjia, Zheng, Qian, Yang, Long, Li, Pengfei, Pan, Gang

In this paper, we focus on policy discrepancy in return-based deep Q-network (R-DQN) learning. We propose a general framework for R-DQN, with which most of the return-based reinforcement learning algorithms can be combined with DQN. We show the performance of traditional DQN can be significantly improved by introducing returnbased reinforcement learning. In order to further improve the performance of R-DQN, we present a strategy with two measurements which can qualitatively measure the policy discrepancy. Moreover, we give two bounds for these two measurements under the R-DQN framework. Algorithms with our strategy can accurately express the trace coefficient and achieve a better approximation to return. The experiments are carried out on several representative tasks from the OpenAI Gym library. Results show the algorithms with our strategy outperform the state-of-the-art R-DQN methods.

artificial intelligence, machine learning, reinforcement learning, (3 more...)

1806.06953

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.73)

Azizzadenesheli, Kamyar, Yang, Brandon, Liu, Weitang, Brunskill, Emma, Lipton, Zachary C, Anandkumar, Animashree

Sample-Efficient Deep RL with Generative Adversarial Tree Search

We propose Generative Adversarial Tree Search (GATS), a sample-efficient Deep Reinforcement Learning (DRL) algorithm. While Monte Carlo Tree Search (MCTS) is known to be effective for search and planning in RL, it is often sample-inefficient and therefore expensive to apply in practice. In this work, we develop a Generative Adversarial Network (GAN) architecture to model an environment's dynamics and a predictor model for the reward function. We exploit collected data from interaction with the environment to learn these models, which we then use for model-based planning. During planning, we deploy a finite depth MCTS, using the learned model for tree search and a learned Q-value for the leaves, to find the best action. We theoretically show that GATS improves the bias-variance trade-off in value-based DRL. Moreover, we show that the generative model learns the model dynamics using orders of magnitude fewer samples than the Q-learner. In non-stationary settings where the environment model changes, we find the generative model adapts significantly faster than the Q-learner to the new environment.

gdm, machine learning, reinforcement learning, (10 more...)

1806.0578

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.40)

Industry: Leisure & Entertainment > Games > Computer Games (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.71)

Glavin, Frank G., Madden, Michael G.

Adaptive Shooting for Bots in First Person Shooter Games Using Reinforcement Learning

In current state-of-the-art commercial first person shooter games, computer controlled bots, also known as non player characters, can often be easily distinguishable from those controlled by humans. Tell-tale signs such as failed navigation, "sixth sense" knowledge of human players' whereabouts and deterministic, scripted behaviors are some of the causes of this. We propose, however, that one of the biggest indicators of non humanlike behavior in these games can be found in the weapon shooting capability of the bot. Consistently perfect accuracy and "locking on" to opponents in their visual field from any distance are indicative capabilities of bots that are not found in human players. Traditionally, the bot is handicapped in some way with either a timed reaction delay or a random perturbation to its aim, which doesn't adapt or improve its technique over time. We hypothesize that enabling the bot to learn the skill of shooting through trial and error, in the same way a human player learns, will lead to greater variation in game-play and produce less predictable non player characters. This paper describes a reinforcement learning shooting mechanism for adapting shooting over time based on a dynamic reward signal from the amount of damage caused to opponents.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

1806.05554

Country:

Europe > Germany > Baden-Württemberg > Karlsruhe Region > Heidelberg (0.04)
North America > United States > New York (0.04)
Europe > Ireland > Connaught > County Galway > Galway (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Self-Imitation Learning

Oh, Junhyuk, Guo, Yijie, Singh, Satinder, Lee, Honglak

This paper proposes Self-Imitation Learning (SIL), a simple off-policy actor-critic algorithm that learns to reproduce the agent's past good decisions. This algorithm is designed to verify our hypothesis that exploiting past good experiences can indirectly drive deep exploration. Our empirical results show that SIL significantly improves advantage actor-critic (A2C) on several hard exploration Atari games and is competitive to the state-of-the-art count-based exploration methods. We also show that SIL improves proximal policy optimization (PPO) on MuJoCo tasks.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

1806.05635

Country:

North America > United States > Michigan (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
Europe > Germany > North Rhine-Westphalia > Upper Bavaria > Munich (0.04)

Genre: Research Report > New Finding (0.48)

Industry: Leisure & Entertainment > Games > Computer Games (0.57)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Machine LearningJun-13-2018

Marginal Policy Gradients for Complex Control

Eisenach, Carson, Yang, Haichuan, Liu, Ji, Liu, Han

Many complex domains, such as robotics control and real-time strategy (RTS) games, require an agent to learn a continuous control. In the former, an agent learns a policy over $\mathbb{R}^d$ and in the latter, over a discrete set of actions each of which is parametrized by a continuous parameter. Such problems are naturally solved using policy based reinforcement learning (RL) methods, but unfortunately these often suffer from high variance leading to instability and slow convergence. We show that in many cases a substantial portion of the variance in policy gradient estimators is completely unnecessary and can be eliminated without introducing bias. Unnecessary variance is introduced whenever policies over bounded action spaces are modeled using distributions with unbounded support, by applying a transformation $T$ to the sampled action before execution in the environment. Recent works have studied variance reduced policy gradients for actions in bounded intervals, but to date no variance reduced methods exist when the action is a direction -- constrained to the unit sphere -- something often seen in RTS games. To address these challenges we: (1) introduce a stochastic policy gradient method for directional control; (2) introduce the marginal policy gradient framework, a powerful technique to obtain variance reduced policy gradients for arbitrary $T$; (3) show that marginal policy gradients are guaranteed to reduce variance, quantifying that reduction exactly; (4) validate our framework by applying the methods to a popular RTS game and a navigation task, demonstrating improvement over a policy gradient baseline.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Machine Learning

1806.05134

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Washington > King County > Bellevue (0.04)
North America > United States > New York > Monroe County > Rochester (0.04)
(2 more...)

Genre: Research Report (0.64)

Industry: Leisure & Entertainment > Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.90)

Schultz, Laura, Sokolov, Vadim

Deep Reinforcement Learning for Dynamic Urban Transportation Problems

arXiv.org Machine LearningJun-13-2018

We explore the use of deep learning and deep reinforcement learning for optimization problems in transportation. Many transportation system analysis tasks are formulated as an optimization problem - such as optimal control problems in intelligent transportation systems and long term urban planning. Often transportation models used to represent dynamics of a transportation system involve large data sets with complex input-output interactions and are difficult to use in the context of optimization. Use of deep learning metamodels can produce a lower dimensional representation of those relations and allow to implement optimization and reinforcement learning algorithms in an efficient manner. In particular, we develop deep learning models for calibrating transportation simulators and for reinforcement learning to solve the problem of optimal scheduling of travelers on the network.

machine learning, reinforcement, reinforcement learning, (14 more...)

arXiv.org Machine Learning

1806.0531

Country:

North America > United States > Virginia > Fairfax County > Fairfax (0.04)
North America > United States > Rhode Island > Providence County > Providence (0.04)
Europe > Netherlands > North Brabant > Eindhoven (0.04)
(2 more...)

Genre: Research Report (0.40)

Industry: Transportation > Infrastructure & Services (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Glavin, Frank G., Madden, Michael G.

Learning to Shoot in First Person Shooter Games by Stabilizing Actions and Clustering Rewards for Reinforcement Learning

arXiv.org Artificial IntelligenceJun-13-2018

While reinforcement learning (RL) has been applied to turn-based board games for many years, more complex games involving decision-making in real-time are beginning to receive more attention. A challenge in such environments is that the time that elapses between deciding to take an action and receiving a reward based on its outcome can be longer than the interval between successive decisions. We explore this in the context of a non-player character (NPC) in a modern first-person shooter game. Such games take place in 3D environments where players, both human and computer-controlled, compete by engaging in combat and completing task objectives. We investigate the use of RL to enable NPCs to gather experience from game-play and improve their shooting skill over time from a reward signal based on the damage caused to opponents. We propose a new method for RL updates and reward calculations, in which the updates are carried out periodically, after each shooting encounter has ended, and a new weighted-reward mechanism is used which increases the reward applied to actions that lead to damaging the opponent in successive hits in what we term "hit clusters".

artificial intelligence, machine learning, reinforcement learning, (17 more...)

1806.05117

Country: Europe > Ireland > Connaught > County Galway > Galway (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Panov, Aleksandr I., Skrynnik, Aleksey

Automatic formation of the structure of abstract machines in hierarchical reinforcement learning with state clustering

arXiv.org Artificial IntelligenceJun-13-2018

We introduce a new approach to hierarchy formation and task decomposition in hierarchical reinforcement learning. Our method is based on the Hierarchy Of Abstract Machines (HAM) framework because HAM approach is able to design efficient controllers that will realize specific behaviors in real robots. The key to our algorithm is the introduction of the internal or "mental" environment in which the state represents the structure of the HAM hierarchy. The internal action in this environment leads to changes the hierarchy of HAMs. We propose the classical Q-learning procedure in the internal environment which allows the agent to obtain an optimal hierarchy. We extends the HAM framework by adding on-model approach to select the appropriate sub-machine to execute action sequences for certain class of external environment states. Preliminary experiments demonstrated the prospects of the method.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

1806.05292

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
Asia > Russia (0.04)

Genre:

Workflow (0.67)
Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Demirel, Burak, Ramaswamy, Arunselvan, Quevedo, Daniel E., Karl, Holger

DeepCAS: A Deep Reinforcement Learning Algorithm for Control-Aware Scheduling

arXiv.org Artificial IntelligenceJun-13-2018

We consider networked control systems consisting of multiple independent controlled subsystems, operating over a shared communication network. Such systems are ubiquitous in cyber-physical systems, Internet of Things, and large-scale industrial systems. In many large-scale settings, the size of the communication network is smaller than the size of the system. In consequence, scheduling issues arise. The main contribution of this paper is to develop a deep reinforcement learning-based \emph{control-aware} scheduling (\textsc{DeepCAS}) algorithm to tackle these issues. We use the following (optimal) design strategy: First, we synthesize an optimal controller for each subsystem; next, we design a learning algorithm that adapts to the chosen subsystems (plants) and controllers. As a consequence of this adaptation, our algorithm finds a schedule that minimizes the \emph{control loss}. We present empirical results to show that \textsc{DeepCAS} finds schedules with better performance than periodic ones.

machine learning, reinforcement learning, subsystem, (18 more...)

1803.02998

Country: Europe > Germany (0.04)

Genre: Research Report (0.50)

Industry: Information Technology > Smart Houses & Appliances (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)