Goto

Collaborating Authors

 Reinforcement Learning


Automated Video Game Testing Using Synthetic and Human-Like Agents

arXiv.org Artificial Intelligence

In this paper, we present a new methodology that employs tester agents to automate video game testing. We introduce two types of agents -synthetic and human-like- and two distinct approaches to create them. Our agents are derived from Reinforcement Learning (RL) and Monte Carlo Tree Search (MCTS) agents, but focus on finding defects. The synthetic agent uses test goals generated from game scenarios, and these goals are further modified to examine the effects of unintended game transitions. The human-like agent uses test goals extracted by our proposed multiple greedy-policy inverse reinforcement learning (MGP-IRL) algorithm from tester trajectories. MGPIRL captures multiple policies executed by human testers. These testers' aims are finding defects while interacting with the game to break it, which is considerably different from game playing. We present interaction states to model such interactions. We use our agents to produce test sequences, run the game with these sequences, and check the game for each run with an automated test oracle. We analyze the proposed method in two parts: we compare the success of human-like and synthetic agents in bug finding, and we evaluate the similarity between humanlike agents and human testers. We collected 427 trajectories from human testers using the General Video Game Artificial Intelligence (GVG-AI) framework and created three games with 12 levels that contain 45 bugs. Our experiments reveal that human-like and synthetic agents compete with human testers' bug finding performances. Moreover, we show that MGP-IRL increases the human-likeness of agents while improving the bug finding performance.


Scalable and transferable learning of algorithms via graph embedding for multi-robot reward collection

arXiv.org Artificial Intelligence

Can the success of reinforcement learning methods for combinatorial optimization problems be extended to multi-robot scheduling problems in stochastic contexts? Three issues are particularly important in this context: quality of the resulting decisions, scalability, and transferability. To achieve these ends we generalize the concept of clique potential to stochastic clique potential. We extend a mean field inference fixed point iteration with this new concept and use it to modify thestructure2vec method. We next propose a new reinforcement learning framework combining a graph representation of the problem and a consensus auction inspired by heuristics in the problem domain. This representation enables transferability in terms of the number of robots. Sequential encoding of information through multiple layers of our extended structure2vec results in 96% optimal performance of the learned heuristics. While training tractability is inherited from single robot methods in the literature, use of a multi-robot consensus auction-based relaxation of the maximum operation in the Bellman optimality equation allows for scalable selection of actions in the fitted Q-iteration. We apply our framework to multi-robot reward collection (MRRC) problems in stochastic environments with linear or non-linear rewards. In stochastic environments with non-linear rewards, the new method achieves 20% superior performance relative to the popular sequential greedy assignment (SGA) algorithm. Linear scalability in terms of training is achieved and demonstrated. Transferability is demonstrated by the use of a heuristic trained with three robots that continues to achieve 95% optimal performance when applied to problems with various numbers of robots. We further mention the results obtained when extending the approach to identical parallel machine scheduling(IPMS) problems.


r/artificial - Beat the AI in Super Mario Bros

#artificialintelligence

For an event we created a model that plays Super Mario Bros on NES. At our booth people were able to play against the model. The goal was to beat the model in level completion time. If you are interested how to build such a model based on reinforcement learning with tensorflow, check out our blogpost.


r/deeplearning - Is it possible to make a reinforcement learning framework good at multiple undefined tasks of the same type?

#artificialintelligence

I know that a reinforcement learning framework can become superhumanly good at a given task over time, but what if I need it to be good at multiple tasks, given that all of them follow similar kind of steps.


Decision-Making in Reinforcement Learning

arXiv.org Artificial Intelligence

In this research work, probabilistic decision-making approaches are studied, e.g. Bayesian and Boltzmann strategies, along with various deterministic exploration strategies, e.g. greedy, epsilon-Greedy and random approaches. In this research work, a comparative study has been done between probabilistic and deterministic decision-making approaches, the experiments are performed in OpenAI gym environment, solving Cart Pole problem. This research work discusses about the Bayesian approach to decision-making in deep reinforcement learning, and about dropout, how it can reduce the computational cost. All the exploration approaches are compared. It also discusses about the importance of exploration in deep reinforcement learning, and how improving exploration strategies may help in science and technology. This research work shows how probabilistic decision-making approaches are better in the long run as compared to the deterministic approaches. When there is uncertainty, Bayesian dropout approach proved to be better than all other approaches in this research work.


Reinforcement Learning Experience Reuse with Policy Residual Representation

arXiv.org Machine Learning

Experience reuse is key to sample-efficient reinforcement learning. One of the critical issues is how the experience is represented and stored. Previously, the experience can be stored in the forms of features, individual models, and the average model, each lying at a different granularity. However, new tasks may require experience across multiple granularities. In this paper, we propose the policy residual representation (PRR) network, which can extract and store multiple levels of experience. PRR network is trained on a set of tasks with a multi-level architecture, where a module in each level corresponds to a subset of the tasks. Therefore, the PRR network represents the experience in a spectrum-like way. When training on a new task, PRR can provide different levels of experience for accelerating the learning. We experiment with the PRR network on a set of grid world navigation tasks, locomotion tasks, and fighting tasks in a video game. The results show that the PRR network leads to better reuse of experience and thus outperforms some state-of-the-art approaches.


Interval timing in deep reinforcement learning agents

arXiv.org Artificial Intelligence

The measurement of time is central to intelligent behavior. We know that both animals and artificial agents can successfully use temporal dependencies to select actions. In artificial agents, little work has directly addressed (1) which architectural components are necessary for successful development of this ability, (2) how this timing ability comes to be represented in the units and actions of the agent, and (3) whether the resulting behavior of the system converges on solutions similar to those of biology. Here we studied interval timing abilities in deep reinforcement learning agents trained end-to-end on an interval reproduction paradigm inspired by experimental literature on mechanisms of timing. We characterize the strategies developed by recurrent and feedforward agents, which both succeed at temporal reproduction using distinct mechanisms, some of which bear specific and intriguing similarities to biological systems. These findings advance our understanding of how agents come to represent time, and they highlight the value of experimentally inspired approaches to characterizing agent abilities.


Policy Optimization Provably Converges to Nash Equilibria in Zero-Sum Linear Quadratic Games

arXiv.org Machine Learning

We study the global convergence of policy optimization for finding the Nash equilibria (NE) in zero-sum linear quadratic (LQ) games. To this end, we first investigate the landscape of LQ games, viewing it as a nonconvex-nonconcave saddle-point problem in the policy space. Specifically, we show that despite its nonconvexity and nonconcavity, zero-sum LQ games have the property that the stationary point of the objective with respect to the feedback control policies constitutes the NE of the game. Building upon this, we develop three projected nested-gradient methods that are guaranteed to converge to the NE of the game. Moreover, we show that all of these algorithms enjoy both global sublinear and local linear convergence rates. Simulation results are then provided to validate the proposed algorithms. To the best of our knowledge, this work appears the first to investigate the optimization landscape of LQ games, and provably show the convergence of policy optimization methods to the Nash equilibria. Our work serves as an initial step of understanding the theoretical aspects of policy-based reinforcement learning algorithms for zero-sum Markov games in general.


Diversity-Inducing Policy Gradient: Using Maximum Mean Discrepancy to Find a Set of Diverse Policies

arXiv.org Machine Learning

Standard reinforcement learning methods aim to master one way of solving a task whereas there may exist multiple near-optimal policies. Being able to identify this collection of near-optimal policies can allow a domain expert to efficiently explore the space of reasonable solutions. Unfortunately, existing approaches that quantify uncertainty over policies are not ultimately relevant to finding policies with qualitatively distinct behaviors. In this work, we formalize the difference between policies as a difference between the distribution of trajectories induced by each policy, which encourages diversity with respect to both state visitation and action choices. We derive a gradient-based optimization technique that can be combined with existing policy gradient methods to now identify diverse collections of well-performing policies. We demonstrate our approach on benchmarks and a healthcare task.


Attentional Policies for Cross-Context Multi-Agent Reinforcement Learning

arXiv.org Machine Learning

Many potential applications of reinforcement learning in the real world involve interacting with other agents whose numbers vary over time. We propose new neural policy architectures for these multi-agent problems. In contrast to other methods of training an individual, discrete policy for each agent and then enforcing cooperation through some additional inter-policy mechanism, we follow the spirit of recent work on the power of relational inductive biases in deep networks by learning multi-agent relationships at the policy level via an attentional architecture. In our method, all agents share the same policy, but independently apply it in their own context to aggregate the other agents' state information when selecting their next action. The structure of our architectures allow them to be applied on environments with varying numbers of agents. We demonstrate our architecture on a benchmark multi-agent autonomous vehicle coordination problem, obtaining superior results to a full-knowledge, fully-centralized reference solution, and significantly outperforming it when scaling to large numbers of agents.