Goto

Collaborating Authors

 Reinforcement Learning


Neural Contextual Bandits with Deep Representation and Shallow Exploration

arXiv.org Machine Learning

Multi-armed bandits (MAB) (Auer et al., 2002; Audibert et al., 2009; Lattimore and Szepesvári, 2020) are a class of online decision-making problems where an agent needs to learn to maximize its expected cumulative reward while repeatedly interacting with a partially known environment. Based on a bandit algorithm (also called a strategy or policy), in each round, the agent adaptively chooses an arm, and then observes and receives a reward associated with that arm. Since only the reward of the chosen arm will be observed (bandit information feedback), a good bandit algorithm has to deal with the exploration-exploitation dilemma: tradeoff between pulling the best arm based on existing knowledge/history data (exploitation) and trying the arms that have not been fully explored (exploration). In many real-world applications, the agent will also be able to access detailed contexts associated with the arms. For example, when a company wants to choose an advertisement to present to a user, the recommendation will be much more accurate if the company takes into consideration the contents, specifications, and other features of the advertisements in the arm set as well as the profile of the user. To encode the contextual information, contextual bandit models and algorithms have been developed, and widely studied both in theory and in practice (Dani et al., 2008; Rusmevichientong


Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design

arXiv.org Artificial Intelligence

A wide range of reinforcement learning (RL) problems -- including robustness, transfer learning, unsupervised RL, and emergent complexity -- require specifying a distribution of tasks or environments in which a policy will be trained. However, creating a useful distribution of environments is error prone, and takes a significant amount of developer time and effort. We propose Unsupervised Environment Design (UED) as an alternative paradigm, where developers provide environments with unknown parameters, and these parameters are used to automatically produce a distribution over valid, solvable environments. Existing approaches to automatically generating environments suffer from common failure modes: domain randomization cannot generate structure or adapt the difficulty of the environment to the agent's learning progress, and minimax adversarial training leads to worst-case environments that are often unsolvable. To generate structured, solvable environments for our protagonist agent, we introduce a second, antagonist agent that is allied with the environment-generating adversary. The adversary is motivated to generate environments which maximize regret, defined as the difference between the protagonist and antagonist agent's return. We call our technique Protagonist Antagonist Induced Regret Environment Design (PAIRED). Our experiments demonstrate that PAIRED produces a natural curriculum of increasingly complex environments, and PAIRED agents achieve higher zero-shot transfer performance when tested in highly novel environments.


DeepCrawl: Deep Reinforcement Learning for Turn-based Strategy Games

arXiv.org Artificial Intelligence

In this paper we introduce DeepCrawl, a fully-playable Roguelike prototype for iOS and Android in which all agents are controlled by policy networks trained using Deep Reinforcement Learning (DRL). Our aim is to understand whether recent advances in DRL can be used to develop convincing behavioral models for non-player characters in videogames. We begin with an analysis of requirements that such an AI system should satisfy in order to be practically applicable in video game development, and identify the elements of the DRL model used in the DeepCrawl prototype. The successes and limitations of DeepCrawl are documented through a series of playability tests performed on the final game. We believe that the techniques we propose offer insight into innovative new avenues for the development of behaviors for non-player characters in video games, as they offer the potential to overcome critical issues with


[ML UTD 25] Machine Learning Up-To-Date -- Life With Data

#artificialintelligence

Multi-agent reinforcement learning (MARL) has shown recent success in increasingly complex fixed-team zero-sum environments. However, the real world is not zero-sum nor does it have fixed teams; humans face numerous social dilemmas and must learn when to cooperate and when to compete. To successfully deploy agents into the human world, it may be important that they be able to understand and help in our conflicts. Unfortunately, selfish MARL agents typically fail when faced with social dilemmas. In this work, we show evidence of emergent direct reciprocity, indirect reciprocity and reputation, and team formation when training agents with randomized uncertain social preferences (RUSP), a novel environment augmentation that expands the distribution of environments agents play in.


Google AI is now piloting Loon's internet-beaming balloons

Engadget

Alphabet's Loon has shifted to a different type of navigation system for its internet-beaming balloons. Rather than relying on algorithms designed by humans, the balloons are using an artificial intelligence system Loon developed with Google AI over the last few years. A reinforcement learning (RL) system is now in charge of navigation for a fleet of balloons over Kenya, where Loon switched on its first commercial service earlier this year. Loon says this is the first use of an RL model in "a production aerospace system." It also noted the "development is exciting because it shows that reinforcement learning can be applied to real-world use cases."


Google's AI can keep Loon balloons flying for over 300 days in a row

New Scientist

Huge stratospheric balloons that act as floating cell towers in remote areas can stay in the air for hundreds of days thanks to an artificially intelligent pilot created by Google and Loon. Loon, a subsidiary of Google's parent company Alphabet, produces tennis-court-sized balloons that are filled with helium and sent into the stratosphere. Keeping these huge balloons in a fixed position is difficult as they can get blown off course. Now, researchers at Loon and Google have joined forces to create an AI controller that can counter the harsh winds of the stratosphere by releasing air to descend or adding it to ascend, riding atmospheric currents in the desired direction. The two firms used an AI technique called deep reinforcement learning to train the balloon's controllers.


Policy Supervectors: General Characterization of Agents by their Behaviour

arXiv.org Artificial Intelligence

By studying the underlying policies of decision-making agents, we can learn about their shortcomings and potentially improve them. Traditionally, this has been done either by examining the agent's implementation, its behaviour while it is being executed, its performance with a reward/fitness function or by visualizing the density of states the agent visits. However, these methods fail to describe the policy's behaviour in complex, high-dimensional environments or do not scale to thousands of policies, which is required when studying training algorithms. We propose policy supervectors for characterizing agents by the distribution of states they visit, adopting successful techniques from the area of speech technology. Policy supervectors can characterize policies regardless of their design philosophy (e.g. rule-based vs. neural networks) and scale to thousands of policies on a single workstation machine. We demonstrate method's applicability by studying the evolution of policies during reinforcement learning, evolutionary training and imitation learning, providing insight on e.g. how the search space of evolutionary algorithms is also reflected in agent's behaviour, not just in the parameters.


Convergence Proof for Actor-Critic Methods Applied to PPO and RUDDER

arXiv.org Artificial Intelligence

We prove under commonly used assumptions the convergence of actor-critic reinforcement learning algorithms, which simultaneously learn a policy function, the actor, and a value function, the critic. Both functions can be deep neural networks of arbitrary complexity. Our framework allows showing convergence of the well known Proximal Policy Optimization (PPO) and of the recently introduced RUDDER. For the convergence proof we employ recently introduced techniques from the two time-scale stochastic approximation theory. Our results are valid for actor-critic methods that use episodic samples and that have a policy that becomes more greedy during learning. Previous convergence proofs assume linear function approximation, cannot treat episodic examples, or do not consider that policies become greedy. The latter is relevant since optimal policies are typically deterministic.


DERAIL: Diagnostic Environments for Reward And Imitation Learning

arXiv.org Artificial Intelligence

The objective of many real-world tasks is complex and difficult to procedurally specify. This makes it necessary to use reward or imitation learning algorithms to infer a reward or policy directly from human data. Existing benchmarks for these algorithms focus on realism, testing in complex environments. Unfortunately, these benchmarks are slow, unreliable and cannot isolate failures. As a complementary approach, we develop a suite of simple diagnostic tasks that test individual facets of algorithm performance in isolation. We evaluate a range of common reward and imitation learning algorithms on our tasks. Our results confirm that algorithm performance is highly sensitive to implementation details. Moreover, in a case-study into a popular preference-based reward learning implementation, we illustrate how the suite can pinpoint design flaws and rapidly evaluate candidate solutions. The environments are available at https://github.com/HumanCompatibleAI/seals .


Target Reaching Behaviour for Unfreezing the Robot in a Semi-Static and Crowded Environment

arXiv.org Artificial Intelligence

Robot navigation in human semi-static and crowded environments can lead to the freezing problem, where the robot can not move due to the presence of humans standing on its path and no other path is available. Classical approaches of robot navigation do not provide a solution for this problem. In such situations, the robot could interact with the humans in order to clear its path instead of considering them as unanimated obstacles. In this work, we propose a robot behavior for a wheeled humanoid robot that complains with social norms for clearing its path when the robot is frozen due to the presence of humans. The behavior consists of two modules: 1) A detection module, which make use of the Yolo v3 algorithm trained to detect human hands and human arms. 2) A gesture module, which make use of a policy trained in simulation using the Proximal Policy Optimization algorithm. Orchestration of the two models is done using the ROS framework.