### Self-Modeling Agents and Reward Generator Corruption

Hutter's universal artificial intelligence (AI) showed how to define future AI systems by mathematical equations. Here we adapt those equations to define a self-modeling framework, where AI systems learn models of their own calculations of future values. Hutter discussed the possibility that AI agents may maximize rewards by corrupting the source of rewards in the environment. Here we propose a way to avoid such corruption in the self-modeling framework. This paper fits in the context of my book Ethical Artificial Intelligence.

### Model-based Utility Functions

Orseau and Ring, as well as Dewey, have recently described problems, including self-delusion, with the behavior of agents using various definitions of utility functions. An agent's utility function is defined in terms of the agent's history of interactions with its environment. This paper argues, via two examples, that the behavior problems can be avoided by formulating the utility function in two steps: 1) inferring a model of the environment from interactions, and 2) computing utility as a function of the environment model. Basing a utility function on a model that the agent must learn implies that the utility function must initially be expressed in terms of specifications to be matched to structures in the learned model. These specifications constitute prior assumptions about the environment so this approach will not work with arbitrary environments. But the approach should work for agents designed by humans to act in the physical world. The paper also addresses the issue of self-modifying agents and shows that if provided with the possibility to modify their utility functions agents will not choose to do so, under some usual assumptions.

### Reinforcement Learning as a Framework for Ethical Decision Making

Emerging AI systems will be making more and more decisions that impact the lives of humans in a significant way. It is essential, then, that these AI systems make decisions that take into account the desires, goals, and preferences of other people, while simultaneously learning about what those preferences are. In this work, we argue that the reinforcement-learning framework achieves the appropriate generality required to theorize about an idealized ethical artificial agent, and offers the proper foundations for grounding specific questions about ethical learning and decision making that can promote further scientific investigation. We define an idealized formalism for an ethical learner, and conduct experiments on two toy ethical dilemmas, demonstrating the soundness and flexibility of our approach. Lastly, we identify several critical challenges for future advancement in the area that can leverage our proposed framework.

### Categorizing Wireheading in Partially Embedded Agents

$\textit{Embedded agents}$ are not explicitly separated from their environment, lacking clear I/O channels. Such agents can reason about and modify their internal parts, which they are incentivized to shortcut or $\textit{wirehead}$ in order to achieve the maximal reward. In this paper, we provide a taxonomy of ways by which wireheading can occur, followed by a definition of wirehead-vulnerable agents. Starting from the fully dualistic universal agent AIXI, we introduce a spectrum of partially embedded agents and identify wireheading opportunities that such agents can exploit, experimentally demonstrating the results with the GRL simulation platform AIXIjs. We contextualize wireheading in the broader class of all misalignment problems - where the goals of the agent conflict with the goals of the human designer - and conjecture that the only other possible type of misalignment is specification gaming. Motivated by this taxonomy, we define wirehead-vulnerable agents as embedded agents that choose to behave differently from fully dualistic agents lacking access to their internal parts.

### Formalizing Convergent Instrumental Goals

Omohundro has argued that sufficiently advanced AI systems of any design would, by default, have incentives to pursue a number of instrumentally useful subgoals, such as acquiring more computing power and amassing many resources. Omohundro refers to these as “basic AI drives,” and he, along with Bostrom and others, has argued that this means great care must be taken when designing powerful autonomous systems, because even if they have harmless goals, the side effects of pursuing those goals may be quite harmful. These arguments, while intuitively compelling, are primarily philosophical. In this paper, we provide formal models that demonstrate Omohundro’s thesis, thereby putting mathematical weight behind those intuitive claims.