agent know
Corrigibility as a Singular Target: A Vision for Inherently Reliable Foundation Models
Foundation models (FMs) face a critical safety challenge: as capabilities scale, instrumental convergence drives default trajectories toward loss of human control, potentially culminating in existential catastrophe. Current alignment approaches struggle with value specification complexity and fail to address emergent power-seeking behaviors. We propose "Corrigibility as a Singular Target" (CAST)-designing FMs whose overriding objective is empowering designated human principals to guide, correct, and control them. This paradigm shift from static value-loading to dynamic human empowerment transforms instrumental drives: self-preservation serves only to maintain the principal's control; goal modification becomes facilitating principal guidance. We present a comprehensive empirical research agenda spanning training methodologies (RLAIF, SFT, synthetic data generation), scalability testing across model sizes, and demonstrations of controlled instructability. Our vision: FMs that become increasingly responsive to human guidance as capabilities grow, offering a path to beneficial AI that remains as tool-like as possible, rather than supplanting human judgment. This addresses the core alignment problem at its source, preventing the default trajectory toward misaligned instrumental convergence.
Reviews: Multi-Agent Common Knowledge Reinforcement Learning
My two biggest complaints center on 1) the illustrative single-step matrix game of section 4.1 and figure 3 and 2) the practical applications of MACKRL. 1) Since the primary role of the single-step matrix game in section 4.1 is illustrative, it should be much clearer what is going on. How are all 3 policies parameterized? What information does each have access to? What is the training data? First, let's focus on the JAL policy. As presented up until this point in the paper, JAL means centralized training *and* execution.
Model-Free v. Model-Based Reinforcement Learning
So you want to learn about Reinforcement Learning? Be prepared to enter into this field with confusion. Words and terminologies that make explanations confusing at best. Well, let's understand what the broad categories of Reinforcement Learning actually are, and the distinctions between them. From there, we can understand the important characteristics of methods belonging to certain categories, and be able to broaden our overall understanding of the field!
Eger
Actions that affect knowledge asymmetrically between agents occur in numerous domains, from card games such as poker to the secure transmission of information. Applications in such domains often depend on reflection over knowledge, including what an agent knows about what other agents know. We are interested in enabling formal specification of these systems which may be used for executable prototyping as well as verification and other formal reasoning. Dynamic Epistemic Logic provides a formal basis for such reasoning, but is often prohibitively cumbersome to use in practice. We present an implementation and macro system called Ostari, backed by a particular flavor of Dynamic Epistemic Logic, which allows us to scale the ideas to more realistic problems. We demonstrate how actions that manipulate agents' beliefs can be written concisely and how this capability can be applied to modeling a popular card game by utilizing our system's ability to execute action sequences, answer queries about knowledge states, and find action sequences to satisfy a particular goal.
Everyone Knows that Everyone Knows: Gossip Protocols for Super Experts
van Ditmarsch, Hans, Gattinger, Malvin, Ramezanian, Rahim
A gossip protocol is a procedure for sharing secrets in a network. The basic action in a gossip protocol is a telephone call wherein the calling agents exchange all the secrets they know. An agent who knows all secrets is an expert. The usual termination condition is that all agents are experts. Instead, we explore protocols wherein the termination condition is that all agents know that all agents are experts. We call such agents super experts. Additionally, we model that agents who are super experts do not make and do not answer calls. Such agents are called engaged agents. We also model that such gossip protocols are common knowledge among the agents. We investigate conditions under which protocols terminate, both in the synchronous case, where there is a global clock, and in the asynchronous case, where there is not. We show that a commonly known protocol with engaged agents may terminate faster than the same protocol without engaged agents.
A Semantical Account of Progression in the Presence of Defaults
Lakemeyer, Gerhard (RWTH Aachen University) | Levesque, Hector J. (University of Toronto)
In previous work, we proposed a modal fragment of the situation calculus called ES, which fully captures Reiter's basic action theories. ES also has epistemic features, including only-knowing, which refers to all that an agent knows in the sense of having a knowledge base. While our model of only-knowing has appealing properties in the static case, it appears to be problematic when actions come into play. First of all, its utility seems to be restricted to an agent's initial knowledge base. Second, while it has been shown that only-knowing correctly captures default inferences, this was only in the static case, and undesirable properties appear to arise in the presence of actions. In this paper, we remedy both of these shortcomings and propose a new dynamic semantics of only-knowing, which is closely related to Lin and Reiter's notion of progression when actions are performed and where defaults behave properly.