Goto

Collaborating Authors

 incentivize




Corrigibility Transformation: Constructing Goals That Accept Updates

arXiv.org Artificial Intelligence

For an AI's training process to successfully impart a desired goal, it is important that the AI does not attempt to resist the training. However, partially learned goals will often incentivize an AI to avoid further goal updates, as most goals are better achieved by an AI continuing to pursue them. We say that a goal is corrigible if it does not incentivize taking actions that avoid proper goal updates or shutdown. In addition to convergence in training, corrigibility also allows for correcting mistakes and changes in human preferences, which makes it a crucial safety property. Despite this, the existing literature does not include specifications for goals that are both corrigible and competitive with non-corrigible alternatives. We provide a formal definition for corrigibility, then introduce a transformation that constructs a corrigible version of any goal that can be made corrigible, without sacrificing performance. This is done by myopically eliciting predictions of reward conditional on costlessly preventing updates, which then also determine the reward when updates are accepted. The transformation can be modified to recursively extend corrigibility to any new agents created by corrigible agents, and to prevent agents from deliberately modifying their goals. Two gridworld experiments demonstrate that these corrigible goals can be learned effectively, and that they lead to the desired behavior.



f1404c2624fa7f2507ba04fd9dfc5fb1-Supplemental.pdf

Neural Information Processing Systems

The single-step formulation does not account for changes in the student's internal state over In the multi-step formulation, effort put towards studying accumulates in the form of knowledge. We demonstrate this by revisiting the classroom example. The student's grade is then the summation of all scores across time. B.1 Agent's best-response effort sequence A rational agent solves the following optimization to determine his best-response effort policy: { e Recall that the agent's score A dominated effort policy is formally defined as follows: Lemma C.1 Next we look at the complementary slackness condition. From Lemma D.1, we know the form a rational agent's effort Substituting this into Equation 6, we obtain the following characterization of the principal's assessment policy: { E.1 The set of incentivizable effort policies is convex Proof.


Stateful Strategic Regression

Neural Information Processing Systems

A recent line of research investigates how strategic agents may respond to such scoring tools to receive favorable assessments. While prior work has focused on the short-term strategic interactions between a decision-making institution (modeled as a principal) and individual decision-subjects (modeled as agents), we investigate interactions spanning multiple time-steps . In particular, we consider settings in which the agent's effort investment



Learning to Incentivize in Repeated Principal-Agent Problems with Adversarial Agent Arrivals

arXiv.org Artificial Intelligence

We initiate the study of a repeated principal-agent problem over a finite horizon $T$, where a principal sequentially interacts with $K\geq 2$ types of agents arriving in an adversarial order. At each round, the principal strategically chooses one of the $N$ arms to incentivize for an arriving agent of unknown type. The agent then chooses an arm based on its own utility and the provided incentive, and the principal receives a corresponding reward. The objective is to minimize regret against the best incentive in hindsight. Without prior knowledge of agent behavior, we show that the problem becomes intractable, leading to linear regret. We analyze two key settings where sublinear regret is achievable. In the first setting, the principal knows the arm each agent type would select greedily for any given incentive. Under this setting, we propose an algorithm that achieves a regret bound of $O(\min\{\sqrt{KT\log N},K\sqrt{T}\})$ and provide a matching lower bound up to a $\log K$ factor. In the second setting, an agent's response varies smoothly with the incentive and is governed by a Lipschitz constant $L\geq 1$. Under this setting, we show that there is an algorithm with a regret bound of $\tilde{O}((LN)^{1/3}T^{2/3})$ and establish a matching lower bound up to logarithmic factors. Finally, we extend our algorithmic results for both settings by allowing the principal to incentivize multiple arms simultaneously in each round.


Incentivizing Quality Text Generation via Statistical Contracts

arXiv.org Artificial Intelligence

While the success of large language models (LLMs) increases demand for machine-generated text, current pay-per-token pricing schemes create a misalignment of incentives known in economics as moral hazard: Text-generating agents have strong incentive to cut costs by preferring a cheaper model over the cutting-edge one, and this can be done "behind the scenes" since the agent performs inference internally. In this work, we approach this issue from an economic perspective, by proposing a pay-for-performance, contract-based framework for incentivizing quality. We study a principal-agent game where the agent generates text using costly inference, and the contract determines the principal's payment for the text according to an automated quality evaluation. Since standard contract theory is inapplicable when internal inference costs are unknown, we introduce cost-robust contracts. As our main theoretical contribution, we characterize optimal cost-robust contracts through a direct correspondence to optimal composite hypothesis tests from statistics, generalizing a result of Saig et al. (NeurIPS'23). We evaluate our framework empirically by deriving contracts for a range of objectives and LLM evaluation benchmarks, and find that cost-robust contracts sacrifice only a marginal increase in objective value compared to their cost-aware counterparts.


Now, Later, and Lasting: 10 Priorities for AI Research, Policy, and Practice

Communications of the ACM

Advances in artificial intelligence (AI) will transform many aspects of our lives and society, bringing immense opportunities but also posing significant risks and challenges. The next several decades may well be a turning point for humanity, comparable to the industrial revolution. If so, future historians will judge how well we harnessed the benefits of AI for humanity, while protecting against potential harms. In this column, we share a set of recommendations for moving forward from the perspective of a founder and leaders of the One Hundred Year Study on AI.3 Launched 10 years ago with a dedicated endowment, the project is committed to a perpetual series of studies by multidisciplinary experts to evaluate the immediate, longer-term, and far-reaching effects of AI on people and society,1 and to make recommendations about AI research, policy, and practice.2,4 Beyond these recurrent studies and reports, our initiatives have included related efforts aimed at providing a diverse audience with insights about the trajectory of AI, including the creation of the AI Index, an annual benchmarking of AI progress.5