Williams-King, David
Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?
Bengio, Yoshua, Cohen, Michael, Fornasiere, Damiano, Ghosn, Joumana, Greiner, Pietro, MacDermott, Matt, Mindermann, Sören, Oberman, Adam, Richardson, Jesse, Richardson, Oliver, Rondeau, Marc-Antoine, St-Charles, Pierre-Luc, Williams-King, David
The leading AI companies are increasingly focused on building generalist AI agents -- systems that can autonomously plan, act, and pursue goals across almost all tasks that humans can perform. Despite how useful these systems might be, unchecked AI agency poses significant risks to public safety and security, ranging from misuse by malicious actors to a potentially irreversible loss of human control. We discuss how these risks arise from current AI training methods. Indeed, various scenarios and experiments have demonstrated the possibility of AI agents engaging in deception or pursuing goals that were not specified by human operators and that conflict with human interests, such as self-preservation. Following the precautionary principle, we see a strong need for safer, yet still useful, alternatives to the current agency-driven trajectory. Accordingly, we propose as a core building block for further advances the development of a non-agentic AI system that is trustworthy and safe by design, which we call Scientist AI. This system is designed to explain the world from observations, as opposed to taking actions in it to imitate or please humans. It comprises a world model that generates theories to explain data and a question-answering inference machine. Both components operate with an explicit notion of uncertainty to mitigate the risks of overconfident predictions. In light of these considerations, a Scientist AI could be used to assist human researchers in accelerating scientific progress, including in AI safety. In particular, our system can be employed as a guardrail against AI agents that might be created despite the risks involved. Ultimately, focusing on non-agentic AI may enable the benefits of AI innovation while avoiding the risks associated with the current trajectory. We hope these arguments will motivate researchers, developers, and policymakers to favor this safer path.
Representation Engineering for Large-Language Models: Survey and Research Challenges
Bartoszcze, Lukasz, Munshi, Sarthak, Sukidi, Bryan, Yen, Jennifer, Yang, Zejia, Williams-King, David, Le, Linh, Asuzu, Kosi, Maple, Carsten
Large-language models are capable of completing a variety of tasks, but remain unpredictable and intractable. Representation engineering seeks to resolve this problem through a new approach utilizing samples of contrasting inputs to detect and edit high-level representations of concepts such as honesty, harmfulness or power-seeking. We formalize the goals and methods of representation engineering to present a cohesive picture of work in this emerging discipline. We compare it with alternative approaches, such as mechanistic interpretability, prompt-engineering and fine-tuning. We outline risks such as performance decrease, compute time increases and steerability issues. We present a clear agenda for future research to build predictable, dynamic, safe and personalizable LLMs.
Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity
Williams-King, David, Le, Linh, Oberman, Adam, Bengio, Yoshua
As LLMs develop increasingly advanced capabilities, there is an increased need to minimize the harm that could be caused to society by certain model outputs; hence, most LLMs have safety guardrails added, for example via fine-tuning. In this paper, we argue the position that current safety fine-tuning is very similar to a traditional cat-and-mouse game (or arms race) between attackers and defenders in cybersecurity. Model jailbreaks and attacks are patched with bandaids to target the specific attack mechanism, but many similar attack vectors might remain. When defenders are not proactively coming up with principled mechanisms, it becomes very easy for attackers to sidestep any new defenses. We show how current defenses are insufficient to prevent new adversarial jailbreak attacks, reward hacking, and loss of control problems. In order to learn from past mistakes in cybersecurity, we draw analogies with historical examples and develop lessons learned that can be applied to LLM safety. These arguments support the need for new and more principled approaches to designing safe models, which are architected for security from the beginning. We describe several such approaches from the AI literature.
Ensemble Model Patching: A Parameter-Efficient Variational Bayesian Neural Network
Chang, Oscar, Yao, Yuling, Williams-King, David, Lipson, Hod
Two main obstacles preventing the widespread adoption of variational Bayesian neural networks are the high parameter overhead that makes them infeasible on large networks, and the difficulty of implementation, which can be thought of as "programming overhead." MC dropout [Gal and Ghahramani, 2016] is popular because it sidesteps these obstacles. Nevertheless, dropout is often harmful to model performance when used in networks with batch normalization layers [Li et al., 2018], which are an indispensable part of modern neural networks. We construct a general variational family for ensemble-based Bayesian neural networks that encompasses dropout as a special case. We further present two specific members of this family that work well with batch normalization layers, while retaining the benefits of low parameter and programming overhead, comparable to non-Bayesian training. Our proposed methods improve predictive accuracy and achieve almost perfect calibration on a ResNet-18 trained with ImageNet.
The Gold Standard: Automatically Generating Puzzle Game Levels
Williams-King, David (University of Calgary) | Denzinger, Jörg (University of Calgary) | Aycock, John (University of Calgary) | Stephenson, Ben (University of Calgary)
KGoldrunner is a puzzle-oriented platform game with dynamic elements. This paper describes Goldspinner, an automatic level generation system for KGoldrunner. Goldspinner has two parts: a genetic algorithm that generates candidate levels, and simulations that use an AI agent to attempt to solve the level from the player's perspective. Our genetic algorithm determines how "good" a candidate level is by examining many different properties of the level, all based on its static aspects. Once the genetic algorithm identifies a good candidate, simulations are performed to evaluate the dynamic aspects of the level. Levels that are statically good may not be dynamically good (or even solvable), making simulation an essential aspect of our level generation system. By carefully optimizing our genetic algorithm and simulation agent we have created an efficient system capable of generating interesting levels in real time.