Jiang, Minqi
Refining Minimax Regret for Unsupervised Environment Design
Beukman, Michael, Coward, Samuel, Matthews, Michael, Fellows, Mattie, Jiang, Minqi, Dennis, Michael, Foerster, Jakob
In unsupervised environment design, reinforcement learning agents are trained on environment configurations (levels) generated by an adversary that maximises some objective. Regret is a commonly used objective that theoretically results in a minimax regret (MMR) policy with desirable robustness guarantees; in particular, the agent's maximum regret is bounded. However, once the agent reaches this regret bound on all levels, the adversary will only sample levels where regret cannot be further reduced. Although there are possible performance improvements to be made outside of these regret-maximising levels, learning stagnates. In this work, we introduce Bayesian level-perfect MMR (BLP), a refinement of the minimax regret objective that overcomes this limitation. We formally show that solving for this objective results in a subset of MMR policies, and that BLP policies act consistently with a Perfect Bayesian policy over all levels. We further introduce an algorithm, ReMiDi, that results in a BLP policy at convergence. We empirically demonstrate that training on levels from a minimax regret adversary causes learning to prematurely stagnate, but that ReMiDi continues learning.
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
Samvelyan, Mikayel, Raparthy, Sharath Chandra, Lupu, Andrei, Hambro, Eric, Markosyan, Aram H., Bhatt, Manish, Mao, Yuning, Jiang, Minqi, Parker-Holder, Jack, Foerster, Jakob, Rocktรคschel, Tim, Raileanu, Roberta
Large language models (LLMs) have recently experienced remarkable growth in both their capabilities (OpenAI, 2023; Gemini Team et al., 2023; Touvron et al., 2023) and their applications in various fields (NLLB Team et al., 2022; Thirunavukarasu et al., 2023; Schick et al., 2023; Bubeck et al., 2023). As LLMs become increasingly complex and are deployed in safety-critical environments (Singhal et al., 2022; Li et al., 2023; Maddela et al., 2023), it is essential to thoroughly understand their robustness to different inputs. Indeed, the susceptibility of LLMs to user inputs and adversarial prompts -- prompts crafted to mislead the model or exploit its weaknesses, potentially leading to unsafe, biased, or incorrect outputs -- poses a significant challenge (Perez et al., 2022; Wei et al., 2023; Zou et al., 2023). Identifying these vulnerabilities and subsequently mitigating such risks is therefore vital to ensure the safe and reliable operation of LLMs in the real world. Current methods for identifying adversarial prompts aimed at "attacking" LLMs and eliciting undesirable outputs are limited by several factors.
Multi-Agent Diagnostics for Robustness via Illuminated Diversity
Samvelyan, Mikayel, Paglieri, Davide, Jiang, Minqi, Parker-Holder, Jack, Rocktรคschel, Tim
In the rapidly advancing field of multi-agent systems, ensuring robustness in unfamiliar and adversarial settings is crucial. Notwithstanding their outstanding performance in familiar environments, these systems often falter in new situations due to overfitting during the training phase. This is especially pronounced in settings where both cooperative and competitive behaviours are present, encapsulating a dual nature of overfitting and generalisation challenges. To address this issue, we present Multi-Agent Diagnostics for Robustness via Illuminated Diversity (MADRID), a novel approach for generating diverse adversarial scenarios that expose strategic vulnerabilities in pre-trained multi-agent policies. Leveraging the concepts from open-ended learning, MADRID navigates the vast space of adversarial settings, employing a target policy's regret to gauge the vulnerabilities of these settings. We evaluate the effectiveness of MADRID on the 11vs11 version of Google Research Football, one of the most complex environments for multi-agent reinforcement learning. Specifically, we employ MADRID for generating a diverse array of adversarial settings for TiZero, the state-of-the-art approach which "masters" the game through 45 days of training on a large-scale distributed infrastructure. We expose key shortcomings in TiZero's tactical decision-making, underlining the crucial importance of rigorous evaluation in multi-agent systems.
JaxMARL: Multi-Agent RL Environments in JAX
Rutherford, Alexander, Ellis, Benjamin, Gallici, Matteo, Cook, Jonathan, Lupu, Andrei, Ingvarsson, Gardar, Willi, Timon, Khan, Akbir, de Witt, Christian Schroeder, Souly, Alexandra, Bandyopadhyay, Saptarashmi, Samvelyan, Mikayel, Jiang, Minqi, Lange, Robert Tjarko, Whiteson, Shimon, Lacerda, Bruno, Hawes, Nick, Rocktaschel, Tim, Lu, Chris, Foerster, Jakob Nicolaus
Benchmarks play an important role in the development of machine learning algorithms. For example, research in reinforcement learning (RL) has been heavily influenced by available environments and benchmarks. However, RL environments are traditionally run on the CPU, limiting their scalability with typical academic compute. Recent advancements in JAX have enabled the wider use of hardware acceleration to overcome these computational hurdles, enabling massively parallel RL training pipelines and environments. This is particularly useful for multi-agent reinforcement learning (MARL) research. First of all, multiple agents must be considered at each environment step, adding computational burden, and secondly, the sample complexity is increased due to non-stationarity, decentralised partial observability, or other MARL challenges. In this paper, we present JaxMARL, the first open-source code base that combines ease-of-use with GPU enabled efficiency, and supports a large number of commonly used MARL environments as well as popular baseline algorithms. When considering wall clock time, our experiments show that per-run our JAX-based training pipeline is up to 12500x faster than existing approaches. This enables efficient and thorough evaluations, with the potential to alleviate the evaluation crisis of the field. We also introduce and benchmark SMAX, a vectorised, simplified version of the popular StarCraft Multi-Agent Challenge, which removes the need to run the StarCraft II game engine. This not only enables GPU acceleration, but also provides a more flexible MARL environment, unlocking the potential for self-play, meta-learning, and other future applications in MARL. We provide code at https://github.com/flairox/jaxmarl.
Learning to Act without Actions
Schmidt, Dominik, Jiang, Minqi
Pre-training large models on vast amounts of web data has proven to be an effective approach for obtaining powerful, general models in several domains, including language and vision. However, this paradigm has not yet taken hold in deep reinforcement learning (RL). This gap is due to the fact that the most abundant form of embodied behavioral data on the web consists of videos, which do not include the action labels required by existing methods for training policies from offline data. We introduce Latent Action Policies from Observation (LAPO), a method to infer latent actions and, consequently, latent-action policies purely from action-free demonstrations. Our experiments on challenging procedurally-generated environments show that LAPO can act as an effective pre-training method to obtain RL policies that can then be rapidly fine-tuned to expert-level performance. Our approach serves as a key stepping stone to enabling the pre-training of powerful, generalist RL models on the vast amounts of action-free demonstrations readily available on the web.
The Generalization Gap in Offline Reinforcement Learning
Mediratta, Ishita, You, Qingfei, Jiang, Minqi, Raileanu, Roberta
Despite recent progress in offline learning, these methods are still trained and tested on the same environment. In this paper, we compare the generalization abilities of widely used online and offline learning methods such as online reinforcement learning (RL), offline RL, sequence modeling, and behavioral cloning. Our experiments show that offline learning algorithms perform worse on new environments than online learning ones. We also introduce the first benchmark for evaluating generalization in offline learning, collecting datasets of varying sizes and skill-levels from Procgen (2D video games) and WebShop (e-commerce websites). The datasets contain trajectories for a limited number of game levels or natural language instructions and at test time, the agent has to generalize to new levels or instructions. Our experiments reveal that existing offline learning algorithms struggle to match the performance of online RL on both train and test environments. Behavioral cloning is a strong baseline, outperforming state-of-the-art offline RL and sequence modeling approaches when trained on data from multiple environments and tested on new ones. Finally, we find that increasing the diversity of the data, rather than its size, improves performance on new environments for all offline learning algorithms. Our study demonstrates the limited generalization of current offline learning algorithms highlighting the need for more research in this area.
Learning Curricula in Open-Ended Worlds
Jiang, Minqi
Deep reinforcement learning (RL) provides powerful methods for training optimal sequential decision-making agents. As collecting real-world interactions can entail additional costs and safety risks, the common paradigm of sim2real conducts training in a simulator, followed by real-world deployment. Unfortunately, RL agents easily overfit to the choice of simulated training environments, and worse still, learning ends when the agent masters the specific set of simulated environments. In contrast, the real world is highly open-ended, featuring endlessly evolving environments and challenges, making such RL approaches unsuitable. Simply randomizing over simulated environments is insufficient, as it requires making arbitrary distributional assumptions and can be combinatorially less likely to sample specific environment instances that are useful for learning. An ideal learning process should automatically adapt the training environment to maximize the learning potential of the agent over an open-ended task space that matches or surpasses the complexity of the real world. This thesis develops a class of methods called Unsupervised Environment Design (UED), which aim to produce such open-ended processes. Given an environment design space, UED automatically generates an infinite sequence or curriculum of training environments at the frontier of the learning agent's capabilities. Through extensive empirical studies and theoretical arguments founded on minimax-regret decision theory and game theory, the findings in this thesis show that UED autocurricula can produce RL agents exhibiting significantly improved robustness and generalization to previously unseen environment instances. Such autocurricula are promising paths toward open-ended learning systems that achieve more general intelligence by continually generating and mastering additional challenges of their own design.
minimax: Efficient Baselines for Autocurricula in JAX
Jiang, Minqi, Dennis, Michael, Grefenstette, Edward, Rocktรคschel, Tim
Unsupervised environment design (UED) is a form of automatic curriculum learning for training robust decision-making agents to zero-shot transfer into unseen environments. Such autocurricula have received much interest from the RL community. However, UED experiments, based on CPU rollouts and GPU model updates, have often required several weeks of training. This compute requirement is a major obstacle to rapid innovation for the field. This work introduces the minimax library for UED training on accelerated hardware. Using JAX to implement fully-tensorized environments and autocurriculum algorithms, minimax allows the entire training loop to be compiled for hardware acceleration. To provide a petri dish for rapid experimentation, minimax includes a tensorized grid-world based on MiniGrid, in addition to reusable abstractions for conducting autocurricula in procedurally-generated environments. With these components, minimax provides strong UED baselines, including new parallelized variants, which achieve over 120$\times$ speedups in wall time compared to previous implementations when training with equal batch sizes. The minimax library is available under the Apache 2.0 license at https://github.com/facebookresearch/minimax.
ADGym: Design Choices for Deep Anomaly Detection
Jiang, Minqi, Hou, Chaochuan, Zheng, Ao, Han, Songqiao, Huang, Hailiang, Wen, Qingsong, Hu, Xiyang, Zhao, Yue
Deep learning (DL) techniques have recently found success in anomaly detection (AD) across various fields such as finance, medical services, and cloud computing. However, most of the current research tends to view deep AD algorithms as a whole, without dissecting the contributions of individual design choices like loss functions and network architectures. This view tends to diminish the value of preliminary steps like data preprocessing, as more attention is given to newly designed loss functions, network architectures, and learning paradigms. In this paper, we aim to bridge this gap by asking two key questions: (i) Which design choices in deep AD methods are crucial for detecting anomalies? (ii) How can we automatically select the optimal design choices for a given AD dataset, instead of relying on generic, pre-existing solutions? To address these questions, we introduce ADGym, a platform specifically crafted for comprehensive evaluation and automatic selection of AD design elements in deep methods. Our extensive experiments reveal that relying solely on existing leading methods is not sufficient. In contrast, models developed using ADGym significantly surpass current state-of-the-art techniques.
Discovering General Reinforcement Learning Algorithms with Adversarial Environment Design
Jackson, Matthew Thomas, Jiang, Minqi, Parker-Holder, Jack, Vuorio, Risto, Lu, Chris, Farquhar, Gregory, Whiteson, Shimon, Foerster, Jakob Nicolaus
The past decade has seen vast progress in deep reinforcement learning (RL) on the back of algorithms manually designed by human researchers. Recently, it has been shown that it is possible to meta-learn update rules, with the hope of discovering algorithms that can perform well on a wide range of RL tasks. Despite impressive initial results from algorithms such as Learned Policy Gradient (LPG), there remains a generalization gap when these algorithms are applied to unseen environments. In this work, we examine how characteristics of the meta-training distribution impact the generalization performance of these algorithms. Motivated by this analysis and building on ideas from Unsupervised Environment Design (UED), we propose a novel approach for automatically generating curricula to maximize the regret of a meta-learned optimizer, in addition to a novel approximation of regret, which we name algorithmic regret (AR). The result is our method, General RL Optimizers Obtained Via Environment Design (GROOVE). In a series of experiments, we show that GROOVE achieves superior generalization to LPG, and evaluate AR against baseline metrics from UED, identifying it as a critical component of environment design in this setting. We believe this approach is a step towards the discovery of truly general RL algorithms, capable of solving a wide range of real-world environments.