Goto

Collaborating Authors

 Reinforcement Learning


"Other-Play" for Zero-Shot Coordination

arXiv.org Artificial Intelligence

We consider the problem of zero-shot coordination - constructing AI agents that can coordinate with novel partners they have not seen before (e.g. humans). Standard Multi-Agent Reinforcement Learning (MARL) methods typically focus on the self-play (SP) setting where agents construct strategies by playing the game with themselves repeatedly. Unfortunately, applying SP naively to the zero-shot coordination problem can produce agents that establish highly specialized conventions that do not carry over to novel partners they have not been trained with. We introduce a novel learning algorithm called other-play (OP), that enhances self-play by looking for more robust strategies, exploiting the presence of known symmetries in the underlying problem. We characterize OP theoretically as well as experimentally. We study the cooperative card game Hanabi and show that OP agents achieve higher scores when paired with independently trained agents. In preliminary results we also show that our OP agents obtains higher average scores when paired with human players, compared to state-of-the-art SP agents.


Reinforcement-learning AIs are vulnerable to a new kind of attack

#artificialintelligence

The soccer bot lines up to take a shot at the goal. But instead of getting ready to block it, the goalkeeper drops to ground and wiggles its legs. Confused, the striker does a weird little sideways dance, stamping its feet and waving one arm, and then falls over. It's not a tactic you'll see used by the pros, but it shows that an artificial intelligence trained via deep reinforcement learning--the technique behind cutting-edge game-playing AIs like AlphaZero and the OpenAI Five--is more vulnerable to attack than previously thought. And that could have serious consequences.


On the Robustness of Cooperative Multi-Agent Reinforcement Learning

arXiv.org Machine Learning

In cooperative multi-agent reinforcement learning (c-MARL), agents learn to cooperatively take actions as a team to maximize a total team reward. We analyze the robustness of c-MARL to adversaries capable of attacking one of the agents on a team. Through the ability to manipulate this agent's observations, the adversary seeks to decrease the total team reward. Attacking c-MARL is challenging for three reasons: first, it is difficult to estimate team rewards or how they are impacted by an agent mispredicting; second, models are non-differentiable; and third, the feature space is low-dimensional. Thus, we introduce a novel attack. The attacker first trains a policy network with reinforcement learning to find a wrong action it should encourage the victim agent to take. Then, the adversary uses targeted adversarial examples to force the victim to take this action. Our results on the StartCraft II multi-agent benchmark demonstrate that c-MARL teams are highly vulnerable to perturbations applied to one of their agent's observations. By attacking a single agent, our attack method has highly negative impact on the overall team reward, reducing it from 20 to 9.4. This results in the team's winning rate to go down from 98.9% to 0%.


Deep Adversarial Reinforcement Learning for Object Disentangling

arXiv.org Artificial Intelligence

Deep learning in combination with improved training techniques and high computational power has led to recent advances in the field of reinforcement learning (RL) and to successful robotic RL applications such as in-hand manipulation. However, most robotic RL relies on a well known initial state distribution. In real-world tasks, this information is however often not available. For example, when disentangling waste objects the actual position of the robot w.r.t.\ the objects may not match the positions the RL policy was trained for. To solve this problem, we present a novel adversarial reinforcement learning (ARL) framework. The ARL framework utilizes an adversary, which is trained to steer the original agent, the protagonist, to challenging states. We train the protagonist and the adversary jointly to allow them to adapt to the changing policy of their opponent. We show that our method can generalize from training to test scenarios by training an end-to-end system for robot control to solve a challenging object disentangling task. Experiments with a KUKA LBR+ 7-DOF robot arm show that our approach outperforms the baseline method in disentangling when starting from different initial states than provided during training.


Reinforcement Learning-- An Introduction to Gradient Temporal Difference Learning Algorithms

#artificialintelligence

Reinforcement learning is one of the hottest fields to be in right now, with concrete applications growing at an incredibly rapid pace, from beating video games to robotics. At its essence, reinforcement learning (RL) deals with decision making --i.e. it attempts to answer the question of how an agent should act in a given environment. Loosely speaking, all of RL comes down to either finding or evaluating a policy, which is just a way of behaving. For example, a policy could be a playing strategy in chess. A policy takes a state -- in the chess example, the position of all the pieces on the board -- and assigns an action to it. For example, given the state of your chess board, your policy might ask you to move your queen forward.


Q Learning Intro/Table - Reinforcement Learning p.1

#artificialintelligence

Welcome to a reinforcement learning tutorial. In this part, we're going to focus on Q-Learning. Q-Learning is a model-free form of machine learning, in the sense that the AI "agent" does not need to know or have a model of the environment that it will be in. The same algorithm can be used across a variety of environments. For a given environment, everything is broken down into "states" and "actions."


Generative Adversarial Imitation Learning with Neural Networks: Global Optimality and Convergence Rate

arXiv.org Machine Learning

Generative adversarial imitation learning (GAIL) demonstrates tremendous success in practice, especially when combined with neural networks. Different from reinforcement learning, GAIL learns both policy and reward function from expert (human) demonstration. Despite its empirical success, it remains unclear whether GAIL with neural networks converges to the globally optimal solution. The major difficulty comes from the nonconvex-nonconcave minimax optimization structure. To bridge the gap between practice and theory, we analyze a gradient-based algorithm with alternating updates and establish its sublinear convergence to the globally optimal solution. To the best of our knowledge, our analysis establishes the global optimality and convergence rate of GAIL with neural networks for the first time.


Reinforcement Learning for Combinatorial Optimization: A Survey

arXiv.org Machine Learning

Combinatorial optimization (CO) is the workhorse of numerous important applications in operations research, engineering and other fields and, thus, has been attracting enormous attention from the research community for over a century. Many efficient solutions to common problems involve using hand-crafted heuristics to sequentially construct a solution. Therefore, it is intriguing to see how a combinatorial optimization problem can be formulated as a sequential decision making process and whether efficient heuristics can be implicitly learned by a reinforcement learning agent to find a solution. This survey explores the synergy between CO and reinforcement learning (RL) framework, which can become a promising direction for solving combinatorial problems.


Convergence of Q-value in case of Gaussian rewards

arXiv.org Machine Learning

In this paper, as a study of reinforcement learning, we converge the Q function to unbounded rewards such as Gaussian distribution. From the central limit theorem, in some real-world applications it is natural to assume that rewards follow a Gaussian distribution , but existing proofs cannot guarantee convergence of the Q-function. Furthermore, in the distribution-type reinforcement learning and Bayesian reinforcement learning that have become popular in recent years, it is better to allow the reward to have a Gaussian distribution. Therefore, in this paper, we prove the convergence of the Q-function under the condition of $E[r(s,a)^2]<\infty$, which is much more relaxed than the existing research. Finally, as a bonus, a proof of the policy gradient theorem for distributed reinforcement learning is also posted.


Adversarial Machine Learning: Perspectives from Adversarial Risk Analysis

arXiv.org Artificial Intelligence

Adversarial Machine Learning (AML) is emerging as a major field aimed at the protection of automated ML systems against security threats. The majority of work in this area has built upon a game-theoretic framework by modelling a conflict between an attacker and a defender. After reviewing game-theoretic approaches to AML, we discuss the benefits that a Bayesian Adversarial Risk Analysis perspective brings when defending ML based systems. A research agenda is included.