Critch, Andrew
TASRA: a Taxonomy and Analysis of Societal-Scale Risks from AI
Critch, Andrew, Russell, Stuart
While several recent works have identified societal-scale and extinction-level risks to humanity arising from artificial intelligence, few have attempted an {\em exhaustive taxonomy} of such risks. Many exhaustive taxonomies are possible, and some are useful -- particularly if they reveal new risks or practical approaches to safety. This paper explores a taxonomy based on accountability: whose actions lead to the risk, are the actors unified, and are they deliberate? We also provide stories to illustrate how the various risk types could each play out, including risks arising from unanticipated interactions of many AI systems, as well as risks from deliberate misuse, for which combined technical and policy solutions are indicated.
WordSig: QR streams enabling platform-independent self-identification that's impossible to deepfake
Critch, Andrew
Deepfakes can degrade the fabric of society by limiting our ability to trust video content from leaders, authorities, and even friends. Cryptographically secure digital signatures may be used by video streaming platforms to endorse content, but these signatures are applied by the content distributor rather than the participants in the video. We introduce WordSig, a simple protocol allowing video participants to digitally sign the words they speak using a stream of QR codes, and allowing viewers to verify the consistency of signatures across videos. This allows establishing a trusted connection between the viewer and the participant that is not mediated by the content distributor. Given the widespread adoption of QR codes for distributing hyperlinks and vaccination records, and the increasing prevalence of celebrity deepfakes, 2022 or later may be a good time for public figures to begin using and promoting QR-based self-authentication tools.
Detecting Modularity in Deep Neural Networks
Hod, Shlomi, Casper, Stephen, Filan, Daniel, Wild, Cody, Critch, Andrew, Russell, Stuart
A neural network is modular to the extent that parts of its computational graph (i.e. structure) can be represented as performing some comprehensible subtask relevant to the overall task (i.e. functionality). Are modern deep neural networks modular? How can this be quantified? In this paper, we consider the problem of assessing the modularity exhibited by a partitioning of a network's neurons. We propose two proxies for this: importance, which reflects how crucial sets of neurons are to network performance; and coherence, which reflects how consistently their neurons associate with features of the inputs. To measure these proxies, we develop a set of statistical methods based on techniques conventionally used to interpret individual neurons. We apply the proxies to partitionings generated by spectrally clustering a graph representation of the network's neurons with edges determined either by network weights or correlations of activations. We show that these partitionings, even ones based only on weights (i.e. strictly from non-runtime analysis), reveal groups of neurons that are important and coherent. These results suggest that graph-based partitioning can reveal modularity and help us understand how deep neural networks function.
Accumulating Risk Capital Through Investing in Cooperation
Roman, Charlotte, Dennis, Michael, Critch, Andrew, Russell, Stuart
Recent work on promoting cooperation in multi-agent learning has resulted in many methods which successfully promote cooperation at the cost of becoming more vulnerable to exploitation by malicious actors. We show that this is an unavoidable trade-off and propose an objective which balances these concerns, promoting both safety and long-term cooperation. Moreover, the trade-off between safety and cooperation is not severe, and you can receive exponentially large returns through cooperation from a small amount of risk. We study both an exact solution method and propose a method for training policies that targets this objective, Accumulating Risk Capital Through Investing in Cooperation (ARCTIC), and evaluate them in iterated Prisoner's Dilemma and Stag Hunt.
Multi-Principal Assistance Games: Definition and Collegial Mechanisms
Fickinger, Arnaud, Zhuang, Simon, Critch, Andrew, Hadfield-Menell, Dylan, Russell, Stuart
We introduce the concept of a multi-principal assistance game (MPAG), and circumvent an obstacle in social choice theory -- Gibbard's theorem -- by using a sufficiently "collegial" preference inference mechanism. In an MPAG, a single agent assists N human principals who may have widely different preferences. MPAGs generalize assistance games, also known as cooperative inverse reinforcement learning games. We analyze in particular a generalization of apprenticeship learning in which the humans first perform some work to obtain utility and demonstrate their preferences, and then the robot acts to further maximize the sum of human payoffs. We show in this setting that if the game is sufficiently collegial -- i.e., if the humans are responsible for obtaining a sufficient fraction of the rewards through their own actions -- then their preferences are straightforwardly revealed through their work. This revelation mechanism is non-dictatorial, does not limit the possible outcomes to two alternatives, and is dominant-strategy incentive-compatible.
Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design
Dennis, Michael, Jaques, Natasha, Vinitsky, Eugene, Bayen, Alexandre, Russell, Stuart, Critch, Andrew, Levine, Sergey
A wide range of reinforcement learning (RL) problems -- including robustness, transfer learning, unsupervised RL, and emergent complexity -- require specifying a distribution of tasks or environments in which a policy will be trained. However, creating a useful distribution of environments is error prone, and takes a significant amount of developer time and effort. We propose Unsupervised Environment Design (UED) as an alternative paradigm, where developers provide environments with unknown parameters, and these parameters are used to automatically produce a distribution over valid, solvable environments. Existing approaches to automatically generating environments suffer from common failure modes: domain randomization cannot generate structure or adapt the difficulty of the environment to the agent's learning progress, and minimax adversarial training leads to worst-case environments that are often unsolvable. To generate structured, solvable environments for our protagonist agent, we introduce a second, antagonist agent that is allied with the environment-generating adversary. The adversary is motivated to generate environments which maximize regret, defined as the difference between the protagonist and antagonist agent's return. We call our technique Protagonist Antagonist Induced Regret Environment Design (PAIRED). Our experiments demonstrate that PAIRED produces a natural curriculum of increasingly complex environments, and PAIRED agents achieve higher zero-shot transfer performance when tested in highly novel environments.
The MAGICAL Benchmark for Robust Imitation
Toyer, Sam, Shah, Rohin, Critch, Andrew, Russell, Stuart
Imitation Learning (IL) algorithms are typically evaluated in the same environment that was used to create demonstrations. This rewards precise reproduction of demonstrations in one particular environment, but provides little information about how robustly an algorithm can generalise the demonstrator's intent to substantially different deployment settings. This paper presents the MAGICAL benchmark suite, which permits systematic evaluation of generalisation by quantifying robustness to different kinds of distribution shift that an IL algorithm is likely to encounter in practice. Using the MAGICAL suite, we confirm that existing IL algorithms overfit significantly to the context in which demonstrations are provided. We also show that standard methods for reducing overfitting are effective at creating narrow perceptual invariances, but are not sufficient to enable transfer to contexts that require substantially different behaviour, which suggests that new approaches will be needed in order to robustly generalise demonstrator intent. Code and data for the MAGICAL suite is available at https://github.com/qxcv/magical/.
Aligning AI With Shared Human Values
Hendrycks, Dan, Burns, Collin, Basart, Steven, Critch, Andrew, Li, Jerry, Song, Dawn, Steinhardt, Jacob
We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete understanding of basic ethical knowledge. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.
AI Research Considerations for Human Existential Safety (ARCHES)
Critch, Andrew, Krueger, David
Framed in positive terms, this report examines how technical AI research might be steered in a manner that is more attentive to humanity's long-term prospects for survival as a species. In negative terms, we ask what existential risks humanity might face from AI development in the next century, and by what principles contemporary technical research might be directed to address those risks. A key property of hypothetical AI technologies is introduced, called \emph{prepotence}, which is useful for delineating a variety of potential existential risks from artificial intelligence, even as AI paradigms might shift. A set of \auxref{dirtot} contemporary research \directions are then examined for their potential benefit to existential safety. Each research direction is explained with a scenario-driven motivation, and examples of existing work from which to build. The research directions present their own risks and benefits to society that could occur at various scales of impact, and in particular are not guaranteed to benefit existential safety if major developments in them are deployed without adequate forethought and oversight. As such, each direction is accompanied by a consideration of potentially negative side effects.
Negotiable Reinforcement Learning for Pareto Optimal Sequential Decision-Making
Desai, Nishant, Critch, Andrew, Russell, Stuart J.
It is commonly believed that an agent making decisions on behalf of two or more principals who have different utility functions should adopt a Pareto optimal policy, i.e. a policy that cannot be improved upon for one principal without making sacrifices for another. Harsanyi's theorem shows that when the principals have a common prior on the outcome distributions of all policies, a Pareto optimal policy for the agent is one that maximizes a fixed, weighted linear combination of the principals' utilities. In this paper, we derive a more precise generalization for the sequential decision setting in the case of principals with different priors on the dynamics of the environment. We refer to this generalization as the Negotiable Reinforcement Learning (NRL) framework. In this more general case, the relative weight given to each principal's utility should evolve over time according to how well the agent's observations conform with that principal's prior. To gain insight into the dynamics of this new framework, we implement a simple NRL agent and empirically examine its behavior in a simple environment.