The world's leading Gran Turismo 7 players have had their shot at beating Sony's AI racer, GT Sophy, and they lost spectacularly. Now, the bot has added drifting to its move set to rub salt on the wounds, as seen in a video from the Gran Turismo World Series 2023 event. The video clearly shows an AI-controlled vehicle drifting around the track like an absolute maniac. Meanwhile, most people can't pull off successful drifts in Mario Kart, let alone in realistic racers like Gran Turismo. GT Sophy, you can guess what the GT stands for, is the result of more than six years of development between Sony AI and Sony Interactive Entertainment, and utilizes deep reinforcement learning methods.
We present MDP Playground, a testbed for Reinforcement Learning (RL) agents with dimensions of hardness that can be controlled independently to challenge agents in different ways and obtain varying degrees of hardness in toy and complex RL environments. We consider and allow control over a wide variety of dimensions, including delayed rewards, sequence lengths, reward density, stochasticity, image representations, irrelevant features, time unit, action range and more. We define a parameterised collection of fast-to-run toy environments in OpenAI Gym by varying these dimensions and propose to use these to understand agents better. We then show how to design experiments using MDP Playground to gain insights on the toy environments. We also provide wrappers that can inject many of these dimensions into any Gym environment. We experiment with these wrappers on Atari and Mujoco to allow for understanding the effects of these dimensions on environments that are more complex than the toy environments. We also compare the effect of the dimensions on the toy and complex environments. Finally, we show how to use MDP Playground to debug agents, to study the interaction of multiple dimensions and describe further use-cases.
Strategy-optimization is a fundamental element of dynamic and complex team sports such as soccer, American football, and basketball. As the amount of data that is collected from matches in these sports has increased, so has the demand for data-driven decisionmaking support. If alternative strategies need to be balanced, a data-driven approach can uncover insights that are not available from qualitative analysis. This could tremendously aid teams in their match preparations. In this work, we propose a novel Markov modelbased framework for soccer that allows reasoning about the specific strategies teams use in order to gain insights into the efficiency of each strategy. The framework consists of two components: (1) a learning component, which entails modeling a team’s offensive behavior by learning a Markov decision process (MDP) from event data that is collected from the team’s matches, and (2) a reasoning component, which involves a novel application of probabilistic model checking to reason about the efficacy of the learned strategies of each team. In this paper, we provide an overview of this framework and illustrate it on several use cases using real-world event data from three leagues. Our results show that the framework can be used to reason about the shot decision-making of teams and to optimise the defensive strategies used when playing against a particular team. The general ideas presented in this framework can easily be extended to other sports.
MIT researchers developed a machine-learning technique that can autonomously drive a car or fly a plane through a very difficult "stabilize-avoid" scenario, in which the vehicle must stabilize its trajectory to arrive at and stay within some goal region, while avoiding obstacles. In the film "Top Gun: Maverick," Maverick, played by Tom Cruise, is charged with training young pilots to complete a seemingly impossible mission -- to fly their jets deep into a rocky canyon, staying so low to the ground they cannot be detected by radar, then rapidly climb out of the canyon at an extreme angle, avoiding the rock walls. Spoiler alert: With Maverick's help, these human pilots accomplish their mission. A machine, on the other hand, would struggle to complete the same pulse-pounding task. To an autonomous aircraft, for instance, the most straightforward path toward the target is in conflict with what the machine needs to do to avoid colliding with the canyon walls or staying undetected.
Deep reinforcement learning has achieved impressive results in recent years; yet, it is still severely troubled by environments showcasing sparse rewards. On top of that, not all sparse-reward environments are created equal; in other words, they can differ in the presence or absence of various features, with many of them having a great impact on learning. In light of this, the present work puts together a literature compilation of such environmental features, covering particularly those that have been taken advantage of and those that continue to pose a challenge. We expect this effort to provide guidance to researchers for assessing the generality of their new proposals and to call their attention to issues that remain unresolved when dealing with sparse rewards.
One of the most exciting concepts in the world of artificial intelligence is reinforcement learning. Reinforcement learning is a type of machine learning that involves training an algorithm to make decisions in an environment, with the goal of maximizing a reward. This concept has been applied to a variety of fields, from robotics to game development, and has shown great promise in improving the performance of intelligent systems. At its core, reinforcement learning involves three main components: the environment, the agent, and the reward signal. The environment is the context in which the agent operates and makes decisions.
Learning goal-directed behavior in environments with sparse feedback is a major challenge for reinforcement learning algorithms. One of the key difficulties is insufficient exploration, resulting in an agent being unable to learn robust policies. Intrinsically motivated agents can explore new behavior for their own sake rather than to directly solve external goals. Such intrinsic behaviors could eventually help the agent solve tasks posed by the environment. We present hierarchical-DQN (h-DQN), a framework to integrate hierarchical action-value functions, operating at different temporal scales, with goal-driven intrinsically motivated deep reinforcement learning. A top-level q-value function learns a policy over intrinsic goals, while a lower-level function learns a policy over atomic actions to satisfy the given goals.
We present a novel deep recurrent neural network architecture that learns to build implicit plans in an end-to-end manner purely by interacting with an environment in reinforcement learning setting. The network builds an internal plan, which is continuously updated upon observation of the next input from the environment. It can also partition this internal representation into contiguous sub-sequences by learning for how long the plan can be committed to - i.e. followed without replaning. Combining these properties, the proposed model, dubbed STRategic Attentive Writer (STRAW) can learn high-level, temporally abstracted macro-actions of varying lengths that are solely learnt from data without any prior information. These macro-actions enable both structured exploration and economic computation. We experimentally demonstrate that STRAW delivers strong improvements on several ATARI games by employing temporally extended planning strategies (e.g.
Guided policy search algorithms can be used to optimize complex nonlinear policies, such as deep neural networks, without directly computing policy gradients in the high-dimensional parameter space. Instead, these methods use supervised learning to train the policy to mimic a "teacher" algorithm, such as a trajectory optimizer or a trajectory-centric reinforcement learning method. Guided policy search methods provide asymptotic local convergence guarantees by construction, but it is not clear how much the policy improves within a small, finite number of iterations. We show that guided policy search algorithms can be interpreted as an approximate variant of mirror descent, where the projection onto the constraint manifold is not exact. We derive a new guided policy search algorithm that is simpler and provides appealing improvement and convergence guarantees in simplified convex and linear settings, and show that in the more general nonlinear setting, the error in the projection step can be bounded. We provide empirical results on several simulated robotic navigation and manipulation tasks that show that our method is stable and achieves similar or better performance when compared to prior guided policy search methods, with a simpler formulation and fewer hyperparameters.
E cient exploration remains a major challenge for reinforcement learning (RL). Common dithering strategies for exploration, such as '-greedy, do not carry out temporally-extended (or deep) exploration; this can lead to exponentially larger data requirements. However, most algorithms for statistically e cient RL are not computationally tractable in complex environments. Randomized value functions o er a promising approach to e cient exploration with generalization, but existing algorithms are not compatible with nonlinearly parameterized value functions. As a first step towards addressing such contexts we develop bootstrapped DQN. We demonstrate that bootstrapped DQN can combine deep exploration with deep neural networks for exponentially faster learning than any dithering strategy. In the Arcade Learning Environment bootstrapped DQN substantially improves learning speed and cumulative performance across most games.