Goto

Collaborating Authors

 Reinforcement Learning



What's New in Deep Learning Research: Understanding DeepMind's IMPALA

#artificialintelligence

Deep reinforcement learning has rapidly become one of the hottest research areas in the deep learning ecosystem. The fascination with reinforcement learning is related to the fact that, from all the deep learning modalities, is the one that resemble the most how humans learn. In the last few years, no company in the world has done more to advance the stage of deep reinforcement learning than Alphabet's subsidiary DeepMind. Since the launch of its famous AlphaGo agent, DeepMind has been at the forefront of reinforcement learning research. A few days ago, they published a new research that attempts to tackle one of the most challenging aspects of reinforcement learning solutions: multi-tasking. Since we are infants, multi-tasking becomes an intrinsic element of our cognition.


Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs

arXiv.org Machine Learning

The problem of reinforcement learning in an unknown and discrete Markov Decision Process (MDP) under the average-reward criterion is considered, when the learner interacts with the system in a single stream of observations, starting from an initial state without any reset. We revisit the minimax lower bound for that problem by making appear the local variance of the bias function in place of the diameter of the MDP. Furthermore, we provide a novel analysis of the KL-UCRL algorithm establishing a high-probability regret bound scaling as $\widetilde {\mathcal O}\Bigl({\textstyle \sqrt{S\sum_{s,a}{\bf V}^\star_{s,a}T}}\Big)$ for this algorithm for ergodic MDPs, where $S$ denotes the number of states and where ${\bf V}^\star_{s,a}$ is the variance of the bias function with respect to the next-state distribution following action $a$ in state $s$. The resulting bound improves upon the best previously known regret bound $\widetilde {\mathcal O}(DS\sqrt{AT})$ for that algorithm, where $A$ and $D$ respectively denote the maximum number of actions (per state) and the diameter of MDP. We finally compare the leading terms of the two bounds in some benchmark MDPs indicating that the derived bound can provide an order of magnitude improvement in some cases. Our analysis leverages novel variations of the transportation lemma combined with Kullback-Leibler concentration inequalities, that we believe to be of independent interest.


Q-Learning Algorithm for VoLTE Closed-Loop Power Control in Indoor Small Cells

arXiv.org Machine Learning

We propose a closed-loop power control algorithm for the downlink of the voice over LTE (VoLTE) radio bearer for an indoor environment served by small cells. The main contributions of our paper are: 1) proposing closed-loop power control for downlink VoLTE (or any packetized voice bearer), 2) deriving an upper bound of the loss in VoLTE downlink signal to noise plus interference ratio which the closed-loop power control has to overcome, 3) employing reinforcement learning to perform closed-loop power control, and 4) showing that this closed-loop power control method can improve the quality of VoLTE in a realistic network setup. Our simulation results have shown that our proposed algorithm significantly improved both voice retainability and mean opinion score as a result of maintaining the effective downlink signal to interference plus noise ratio against adverse network operational issues and faults.


Can Deep Reinforcement Learning Solve Erdos-Selfridge-Spencer Games?

arXiv.org Machine Learning

Deep reinforcement learning has achieved many recent successes, but our understanding of its strengths and limitations is hampered by the lack of rich environments in which we can fully characterize optimal behavior, and correspondingly diagnose individual actions against such a characterization. Here we consider a family of combinatorial games, arising from work of Erdos, Selfridge, and Spencer, and we propose their use as environments for evaluating and comparing different approaches to reinforcement learning. These games have a number of appealing features: they are challenging for current learning approaches, but they form (i) a low-dimensional, simply parametrized environment where (ii) there is a linear closed form solution for optimal behavior from any state, and (iii) the difficulty of the game can be tuned by changing environment parameters in an interpretable way. We use these Erdos-Selfridge-Spencer games not only to compare different algorithms, but test for generalization, make comparisons to supervised learning, analyse multiagent play, and even develop a self play algorithm.


Recurrent Predictive State Policy Networks

arXiv.org Machine Learning

We introduce Recurrent Predictive State Policy (RPSP) networks, a recurrent architecture that brings insights from predictive state representations to reinforcement learning in partially observable environments. Predictive state policy networks consist of a recursive filter, which keeps track of a belief about the state of the environment, and a reactive policy that directly maps beliefs to actions, to maximize the cumulative reward. The recursive filter leverages predictive state representations (PSRs) (Rosencrantz and Gordon, 2004; Sun et al., 2016) by modeling predictive state-- a prediction of the distribution of future observations conditioned on history and future actions. This representation gives rise to a rich class of statistically consistent algorithms (Hefny et al., 2018) to initialize the recursive filter. Predictive state serves as an equivalent representation of a belief state. Therefore, the policy component of the RPSP-network can be purely reactive, simplifying training while still allowing optimal behaviour. Moreover, we use the PSR interpretation during training as well, by incorporating prediction error in the loss function. The entire network (recursive filter and reactive policy) is still differentiable and can be trained using gradient based methods. We optimize our policy using a combination of policy gradient based on rewards (Williams, 1992) and gradient descent based on prediction error. We show the efficacy of RPSP-networks under partial observability on a set of robotic control tasks from OpenAI Gym. We empirically show that RPSP-networks perform well compared with memory-preserving networks such as GRUs, as well as finite memory models, being the overall best performing method.


Learning From Scratch by Thinking Fast and Slow with Deep Learning and Tree Search

#artificialintelligence

According to dual process theory human reasoning consists of two different kinds of thinking. System 1 is a fast, unconscious and automatic mode of thought, also known as intuition. System 2 is a slow, conscious, explicit and rule-based mode of reasoning that is believed to be an evolutionarily recent process. When learning to complete a challenging planning task, such as playing a board game, humans exploit both processes: strong intuitions allow for more effective analytic reasoning by rapidly selecting interesting lines of play for consideration. Repeated deep study gradually improves intuitions.


New algorithm lets AI learn from mistakes, become a little more human

#artificialintelligence

In recent months, researchers at OpenAI have been focusing on developing artificial intelligence (AI) that learns better. Their machine learning algorithms are now capable of training themselves, so to speak, thanks to the reinforcement learning methods of their OpenAI Baselines. Now, a new algorithm lets their AI learn from its own mistakes, almost as human beings do. The development comes from a new open-source algorithm called Hindsight Experience Replay (HER), which OpenAI researchers released earlier this week. As its name suggests, HER helps an AI agent "look back" in hindsight, so to speak, as it completes a task.


Consequentialist conditional cooperation in social dilemmas with imperfect information

arXiv.org Artificial Intelligence

Social dilemmas, where mutual cooperation can lead to high payoffs but participants face incentives to cheat, are ubiquitous in multi-agent interaction. We wish to construct agents that cooperate with pure cooperators, avoid exploitation by pure defectors, and incentivize cooperation from the rest. However, often the actions taken by a partner are (partially) unobserved or the consequences of individual actions are hard to predict. We show that in a large class of games good strategies can be constructed by conditioning one's behavior solely on outcomes (ie. one's past rewards). We call this consequentialist conditional cooperation. We show how to construct such strategies using deep reinforcement learning techniques and demonstrate, both analytically and experimentally, that they are effective in social dilemmas beyond simple matrix games. We also show the limitations of relying purely on consequences and discuss the need for understanding both the consequences of and the intentions behind an action.


AI finds novel way to beat classic Q*bert Atari video game

BBC News

Atari video game Q*bert has been beaten by an Artificial Intelligence program, which exploited a loophole that had never previously been discovered. The AI program used trial and error to uncover a quirk in the game's code that let it score a huge amount of points. No human player of Q*bert is believed to have ever uncovered the tricks it used to win. The AI program was let loose on the video game by German researchers who are developing code that can learn. Video games have proved popular with AI researchers because they are limited worlds in which success (high scores) and failure (losing the game) are easy to assess.