Goto

Collaborating Authors

 Reinforcement Learning


Learning Dynamics Model in Reinforcement Learning by Incorporating the Long Term Future

arXiv.org Machine Learning

In model-based reinforcement learning, the agent interleaves between model learning and planning. These two components are inextricably intertwined. If the model is not able to provide sensible long-term prediction, the executed planner would exploit model flaws, which can yield catastrophic failures. This paper focuses on building a model that reasons about the long-term future and demonstrates how to use this for efficient planning and exploration. To this end, we build a latent-variable autoregressive model by leveraging recent ideas in variational inference. We argue that forcing latent variables to carry future information through an auxiliary task substantially improves long-term predictions. Moreover, by planning in the latent space, the planner's solution is ensured to be within regions where the model is valid. An exploration strategy can be devised by searching for unlikely trajectories under the model. Our method achieves higher reward faster compared to baselines on a variety of tasks and environments in both the imitation learning and model-based reinforcement learning settings.


Leveling the Playing Field - Fairness in AI Versus Human Game Benchmarks

arXiv.org Artificial Intelligence

From the beginning if the history of AI, there has been interest in games as a platform of research. As the field developed, human-level competence in complex games became a target researchers worked to reach. Only relatively recently has this target been finally met for traditional tabletop games such as Backgammon, Chess and Go. Current research focus has shifted to electronic games, which provide unique challenges. As is often the case with AI research, these results are liable to be exaggerated or misrepresented by either authors or third parties. The extent to which these games benchmark consist of fair competition between human and AI is also a matter of debate. In this work, we review the statements made by authors and third parties in the general media and academic circle about these game benchmark results and discuss factors that can impact the perception of fairness in the contest between humans and machines


Discovering Options for Exploration by Minimizing Cover Time

arXiv.org Artificial Intelligence

Finding a set of edges that minimizes expected One of the main challenges in reinforcement learning cover time is an extremely hard combinatorial optimization is solving tasks with sparse reward. We show problem (Braess, 1968; Braess et al., 2005). Thus, our that the difficulty of discovering a distant rewarding algorithm instead seeks to minimize the upper bound of the state in an MDP is bounded by the expected expected cover time given as a function of the algebraic cover time of a random walk over the graph induced connectivity of the graph Laplacian (Fiedler, 1973; Broder by the MDP's transition dynamics. We & Karlin, 1989; Chung, 1996) using the heuristic method therefore propose to accelerate exploration by constructing by Ghosh & Boyd (2006) that improves the upper bound of options that minimize cover time. The the expected cover time of a uniform random walk.


Online Antenna Tuning in Heterogeneous Cellular Networks with Deep Reinforcement Learning

arXiv.org Machine Learning

We aim to jointly optimize the antenna tilt angle, and the vertical and horizontal half-power beamwidths of the macrocells in a heterogeneous cellular network (HetNet) via a synergistic combination of deep learning (DL) and reinforcement learning (RL). The interactions between the cells, most notably due to their coupled interference and the large number of users, renders this optimization problem prohibitively complex. This makes the proposed deep RL technique attractive as a practical online solution for real deployments, which should automatically adapt to new base stations being added and other environmental changes in the network. In the proposed algorithm, DL is used to extract the features by learning the locations of the users, and mean field RL is used to learn the average interference values for different antenna settings. Our results illustrate that the proposed deep RL algorithm can approach the optimum weighted sum rate with hundreds of online trials, as opposed to millions of trials for standard Q-learning, assuming relatively low environmental dynamics. Furthermore, the proposed algorithm is compact and implementable, and empirically appears to provide a performance guarantee regardless of the amount of environmental dynamics.


AI2-THOR: An Interactive 3D Environment for Visual AI

arXiv.org Artificial Intelligence

We introduce The House Of inteRactions (THOR), a framework for visual AI research, available at http://ai2thor.allenai.org. AI2-THOR consists of near photo-realistic 3D indoor scenes, where AI agents can navigate in the scenes and interact with objects to perform tasks. AI2-THOR enables research in many different domains including but not limited to deep reinforcement learning, imitation learning, learning by interaction, planning, visual question answering, unsupervised representation learning, object detection and segmentation, and learning models of cognition. The goal of AI2-THOR is to facilitate building visually intelligent models and push the research forward in this domain.


Atari-HEAD: Atari Human Eye-Tracking and Demonstration Dataset

arXiv.org Machine Learning

Additionally, previous research has shown that and eye movements while playing Atari videos games. The given a task context, human visual attention is modulated dataset currently has 44 hours of gameplay data from 16 by reward [5, 9, 17]. In performing a familiar task, objects games and a total of 2.97 million demonstrated actions. Human with high potential reward or penalty attracts human attention subjects played games in a frame-by-frame manner to hence gaze indicates the momentary attentional priorities allow enough decision time in order to obtain near-optimal over multiple objects. Therefore the gaze could be a decisions. This dataset could be potentially used for research potentially useful intermediate learning signal for imitation in imitation learning, reinforcement learning, and learning.


A Review of Reinforcement Learning for Autonomous Building Energy Management

arXiv.org Machine Learning

The area of building energy management has received a significant amount of interest in recent years. This area is concerned with combining advancements in sensor technologies, communications and advanced control algorithms to optimize energy utilization. Reinforcement learning is one of the most prominent machine learning algorithms used for control problems and has had many successful applications in the area of building energy management. This research gives a comprehensive review of the literature relating to the application of reinforcement learning to developing autonomous building energy management systems. The main direction for future research and challenges in reinforcement learning are also outlined.


Successive Over Relaxation Q-Learning

arXiv.org Machine Learning

In a discounted reward Markov Decision Process (MDP) the objective is to find the optimal value function, i.e., the value function corresponding to an optimal policy. This problem reduces to solving a functional equation known as the Bellman equation and a fixed point iteration scheme known as the value iteration is utilized to obtain the solution. In [1], a successive over-relaxation based value iteration scheme is proposed to speed up the computation of the optimal value function. They propose a modified Bellman equation and prove faster convergence to the optimal value function. However, in many practical applications, the model information is not known and we resort to Reinforcement Learning (RL) algorithms to obtain optimal policy and value function. One such popular algorithm is Q-Learning. In this paper, we propose Successive Over Relaxation (SOR) Q-Learning. We first derive a fixed point iteration for optimal Q-values based on [1] and utilize stochastic approximation to derive a learning algorithm to compute the optimal value function and an optimal policy. We then prove the convergence of the SOR Q-Learning to optimal Q-values. Finally, through numerical experiments, we show that SOR Q-Learning is faster compared to the standard Q-Learning algorithm.


Adaptive Variance for Changing Sparse-Reward Environments

arXiv.org Artificial Intelligence

Robots that are trained to perform a task in a fixed environment often fail when facing unexpected changes to the environment due to a lack of exploration. We propose a principled way to adapt the policy for better exploration in changing sparse-reward environments. Unlike previous works which explicitly model environmental changes, we analyze the relationship between the value function and the optimal exploration for a Gaussian-parameterized policy and show that our theory leads to an effective strategy for adjusting the variance of the policy, enabling fast adapt to changes in a variety of sparse-reward environments.


Can Robot Attract Passersby without Causing Discomfort by User-Centered Reinforcement Learning?

arXiv.org Artificial Intelligence

The aim of our study was to develop a method by which a social robot can greet passersby and get their attention without causing them to suffer discomfort.A number of customer services have recently come to be provided by social robots rather than people, including, serving as receptionists, guides, and exhibitors. Robot exhibitors, for example, can explain products being promoted by the robot owners. However, a sudden greeting by a robot can startle passersby and cause discomfort to passersby.Social robots should thus adapt their mannerisms to the situation they face regarding passersby.We developed a method for meeting this requirement on the basis of the results of related work. Our proposed method, user-centered reinforcement learning, enables robots to greet passersby and get their attention without causing them to suffer discomfort (p<0.01) .The results of an experiment in the field, an office entrance, demonstrated that our method meets this requirement.