Agents
Weighted Double Deep Multiagent Reinforcement Learning in Stochastic Cooperative Environments
Zheng, Yan, Hao, Jianye, Zhang, Zongzhang
Recently, multiagent deep reinforcement learning (DRL) has received increasingly wide attention. Existing multiagent DRL algorithms are inefficient when facing with the non-stationarity due to agents update their policies simultaneously in stochastic cooperative environments. This paper extends the recently proposed weighted double estimator to the multiagent domain and propose a multiagent DRL framework, named weighted double deep Q-network (WDDQN). By utilizing the weighted double estimator and the deep neural network, WDDQN can not only reduce the bias effectively but also be extended to scenarios with raw visual inputs. To achieve efficient cooperation in the multiagent domain, we introduce the lenient reward network and the scheduled replay strategy. Experiments show that the WDDQN outperforms the existing DRL and multiaent DRL algorithms, i.e., double DQN and lenient Q-learning, in terms of the average reward and the convergence rate in stochastic cooperative environments.
Lionel Messi, Cristiano Ronaldo, Neymar Cannot Play On Same Team: Agent
The trio is considered the best players in the world right now with Messi and Neymar playing alongside each other for Barcelona from 2013 to 2017 in a spell that saw the club win two La Liga titles, three Copa del Reys, one Champions League crown and the FIFA Club World Cup. However, Neymar shockingly made the move from Barcelona to Paris Saint-Germain last summer after the oil-rich Ligue 1 side activated his €222 million ($273.7 million) release clause in what is still the world record transfer deal. It is highly believed that the Brazilian did not move for money but rather, the chance to play for a side built around him rather than Messi as with his current injury aside, he has flourished this season with 29 goals and 19 assists in 30 games in all competitions with PSG requiring just four points from their last six games to seal a fifth Ligue 1 title in six years. But, there are rumors that the 25-year-old is unhappy in Paris with speculation rising that he could return to La Liga for Barcelona or even rivals Real Madrid where he would be teammates with Ronaldo. There were also reports that Zahavi, who acted as an intermediary for Neymar's move last summer, accompanied a PSG delegation to Brazil last month to check on his injury and determine his value for another potential transfer.
Successful Nash Equilibrium Agent for a 3-Player Imperfect-Information Game
Ganzfried, Sam, Nowak, Austin, Pinales, Joannier
Creating strong agents for games with more than two players is a major open problem in AI. Common approaches are based on approximating game-theoretic solution concepts such as Nash equilibrium, which have strong theoretical guarantees in two-player zero-sum games, but no guarantees in non-zero-sum games or in games with more than two players. We describe an agent that is able to defeat a variety of realistic opponents using an exact Nash equilibrium strategy in a 3-player imperfect-information game. This shows that, despite a lack of theoretical guarantees, agents based on Nash equilibrium strategies can be successful in multiplayer games after all.
Challenges and Characteristics of Intelligent Autonomy for Internet of Battle Things in Highly Adversarial Environments
Numerous, artificially intelligent, networked things will populate the battlefield of the future, operating in close collaboration with human warfighters, and fighting as teams in highly adversarial environments. This paper explores the characteristics, capabilities and intelligence required of such a network of intelligent things and humans - Internet of Battle Things (IOBT). It will experience unique challenges that are not yet well addressed by the current generation of AI and machine learning.
Free agents
For more than half a century, U.S. government officials have considered disaster scenarios, such as the consequences of a nuclear bomb going off in Washington, D.C. Only now, instead of following fixed story lines and predictions assembled ahead of time, they are using computers to play what-if with an entire artificial society: an advanced type of computer simulation called an agent-based model. Today's version of the nuclear attack model includes a digital simulation of every building in the area affected by the bomb, as well as every road, power line, hospital, and even cell tower. The model includes weather data to simulate the fallout plume. And the scenario is peopled with some 730,000 agents.
DORA The Explorer: Directed Outreaching Reinforcement Action-Selection
Choshen, Leshem, Fox, Lior, Loewenstein, Yonatan
Exploration is a fundamental aspect of Reinforcement Learning, typically implemented using stochastic action-selection. Exploration, however, can be more efficient if directed toward gaining new world knowledge. Visit-counters have been proven useful both in practice and in theory for directed exploration. However, a major limitation of counters is their locality. While there are a few model-based solutions to this shortcoming, a model-free approach is still missing. We propose $E$-values, a generalization of counters that can be used to evaluate the propagating exploratory value over state-action trajectories. We compare our approach to commonly used RL techniques, and show that using $E$-values improves learning and performance over traditional counters. We also show how our method can be implemented with function approximation to efficiently learn continuous MDPs. We demonstrate this by showing that our approach surpasses state of the art performance in the Freeway Atari 2600 game.
Cognition in Dynamical Systems, Second Edition
Cognition is the process of knowing. As carried out by a dynamical system, it is the process by which the system absorbs information into its state. A complex network of agents cognizes knowledge about its environment, internal dynamics and initial state by forming emergent, macro-level patterns. Such patterns require each agent to find its place while partially aware of the whole pattern. Such partial awareness can be achieved by separating the system dynamics into two parts by timescale: the propagation dynamics and the pattern dynamics. The fast propagation dynamics describe the spread of signals across the network. If they converge to a fixed point for any quasi-static state of the slow pattern dynamics, that fixed point represents an aggregate of macro-level information. On longer timescales, agents coordinate via positive feedback to form patterns, which are defined using closed walks in the graph of agents. Patterns can be coherent, in that every part of the pattern depends on every other part for context. Coherent patterns are acausal, in that (a) they cannot be predicted and (b) no part of the stored knowledge can be mapped to any part of the pattern, or vice versa. A cognitive network's knowledge is encoded or embodied by the selection of patterns which emerge. The theory of cognition summarized here can model autocatalytic reaction-diffusion systems, artificial neural networks, market economies and ant colony optimization, among many other real and virtual systems. This theory suggests a new understanding of complexity as a lattice of contexts rather than a single measure.
Simple Reinforcement Learning with Tensorflow: Part 2 - Policy-based Agents
After a weeklong break, I am back again with part 2 of my Reinforcement Learning tutorial series. In Part 1, I had shown how to put together a basic agent that learns to choose the more rewarding of two possible options. In this post, I am going to describe how we get from that simple agent to one that is capable of taking in an observation of the world, and taking actions which provide the optimal reward not just in the present, but over the long run. With these additions, we will have a full reinforcement agent. Environments which pose the full problem to an agent are referred to as Markov Decision Processes (MDPs).
Tourist Navigation in Android Smartphone by using Emotion Generating Calculations and Mental State Transition Networks
Ichimura, Takumi, Tanabe, Kosuke, Tachibana, Issei
Mental State Transition Network which consists of mental states connected to each other is a basic concept of approximating to human psychological and mental responses. It can represent transition from an emotional state to other one with stimulus by calculating Emotion Generating Calculations method. A computer agent can transit a mental state in MSTN based on analysis of emotion by EGC method. In this paper, the Andorid EGC which the agent works in Android smartphone can evaluate the feelings in the conversation. The tourist navigation system with the proposed technique in this paper will be expected to be an emotional oriented interface in Android smartphone.
DIPD: Gaze-Based Intention Inference in Dynamic Environments
Jiang, Yu-Sian (University of Texas at Austin) | Warnell, Garrett ( US Army Research Laboratory ) | Stone, Peter (University of Texas at Austin)
The ability of an autonomous system to understand something about a human's intent is important to the success of many systems that involve both humans and autonomous agents. In this work, we consider the specific setting of a human passenger riding in an autonomous vehicle, where the passenger intends to go to or learn about a specific point of interest along the vehicle's route. In this setting, we seek to provide the vehicle with the ability to infer this point of interest using real-time gaze information. This is a difficult problem in that the inference must be designed in the context of the moving vehicle, i.e., in a dynamic environment with dynamic interest points. We propose here a solution to this problem via a novel methodology called Dynamic Interest Point Detection (DIPD) for inferring the point of interest corresponding to the human's intent using gaze tracking data and a dynamic Markov Random Field (MRF) model. The energy function we develop allows the algorithm to successfully filter out noise from the eye tracker, such as eye blinks, high-speed tracking misalignment, and other sources of error. We demonstrate the success of this DIPD technique experimentally and show that it achieves up to a 28% increase in inference success compared to a nearest-neighbor approach.