Goto

Collaborating Authors

 hadfield-menell


b607ba543ad05417b8507ee86c54fcb7-Paper.pdf

Neural Information Processing Systems

Inhuman principal--agent problems, seemingly inconsequential changes toanagent'sincentives often lead to surprising, counter-intuitive, and counter-productive behavior (21). Consequently, we must ask when thismisalignment is costly: when is it counter-productive to optimize for an incompleteproxy?



Defining and Characterizing Reward Hacking

Neural Information Processing Systems

This makes it crucial to align autonomous AI systems with their users' intentions. Precisely specifying which behaviours are or are not desirable is challenging, however. One approach to this specification problem is to learn an approximation of the true reward function (Ng et al., 2000;


Challenges for Using Impact Regularizers to Avoid Negative Side Effects

arXiv.org Artificial Intelligence

Designing reward functions for reinforcement learning is difficult: besides specifying which behavior is rewarded for a task, the reward also has to discourage undesired outcomes. Misspecified reward functions can lead to unintended negative side effects, and overall unsafe behavior. To overcome this problem, recent work proposed to augment the specified reward function with an impact regularizer that discourages behavior that has a big impact on the environment. Although initial results with impact regularizers seem promising in mitigating some types of side effects, important challenges remain. In this paper, we examine the main current challenges of impact regularizers and relate them to fundamental design decisions. We discuss in detail which challenges recent approaches address and which remain unsolved. Finally, we explore promising directions to overcome the unsolved challenges in preventing negative side effects with impact regularizers.


Avoiding Negative Side Effects due to Incomplete Knowledge of AI Systems

arXiv.org Artificial Intelligence

Autonomous agents acting in the real-world often operate based on models that ignore certain aspects of the environment. The incompleteness of any given model---handcrafted or machine acquired---is inevitable due to practical limitations of any modeling technique for complex real-world settings. Due to the limited fidelity of its model, an agent's actions may have unexpected, undesirable consequences during execution. Learning to recognize and avoid such negative side effects of the agent's actions is critical to improving the safety and reliability of autonomous systems. This emerging research topic is attracting increased attention due to the increased deployment of AI systems and their broad societal impacts. This article provides a comprehensive overview of different forms of negative side effects and the recent research efforts to address them. We identify key characteristics of negative side effects, highlight the challenges in avoiding negative side effects, and discuss recently developed approaches, contrasting their benefits and limitations. We conclude with a discussion of open questions and suggestions for future research directions.


Artificial Intelligence Will Do What We Ask. That's a Problem. Quanta Magazine

#artificialintelligence

The danger of having artificially intelligent machines do our bidding is that we might not be careful enough about what we wish for. The lines of code that animate these machines will inevitably lack nuance, forget to spell out caveats, and end up giving AI systems goals and incentives that don't align with our true preferences. A now-classic thought experiment illustrating this problem was posed by the Oxford philosopher Nick Bostrom in 2003. Bostrom imagined a superintelligent robot, programmed with the seemingly innocuous goal of manufacturing paper clips. The robot eventually turns the whole world into a giant paper clip factory. Such a scenario can be dismissed as academic, a worry that might arise in some far-off future.


Less self-assured AI are unlikely to override human orders

Daily Mail - Science & tech

In the Terminator film franchise, hyper-intelligent robots learn to operate without their human masters, leading to a machine uprising that wipes out most of mankind. Researchers have now recommended that humans design intelligent robots of the future with less self-assurance to stop them breaking away from human control. The team suggest that over-confident artificial intelligence can cause an array of problems. Their research found that an AI that is too self-assured will override the wishes of its human supervisor. The team claim that over-confident artificial intelligence can cause an array of problems.


Robots will be more useful if they are made to lack confidence

New Scientist

Confidence in your abilities is usually a good thing – as long as you can recognise when it's time to ask for help. As we build ever smarter software, we may want to apply the same thinking to machines. An experiment that explores a robot's sense of its own usefulness could help guide how future artificial intelligences are built. Overconfident AI can cause all kinds of problems, says Dylan Hadfield-Menell at the University of California, Berkeley. Take Facebook's newsfeed algorithms, for example.


The Off-Switch Game

AAAI Conferences

It is clear that one of the primary tools we can use to mitigate thepotential risk from a misbehaving AI system is the ability to turn thes ystem off. As the capabilities of AI systems improve, it is important to ensure that such systems do not adopt subgoals that prevent a human from switching them off. This is a challenge because many formulations of rational agents create strong incentives for self-preservation. This is not caused by a built-in instinct, but because a rational agent will maximize expected utility and cannot achieve whatever objective it has been given if it is dead.  Our goal is to study the incentives an agent has to allow itself to be switched off. We analyze a simple game between a human H and a robot R, where H can press R's off switch but R can disable the off switch. A traditional agent takes its reward function for granted: we show that such agents have an incentive to disable the off switch, except in the special case where H is perfectly rational. Our key insight is that for R to want to preserve its off switch, it needs to be uncertain about the utility associated with the outcome, and to treat H's actions as important observations about that utility. (R also has no incentive to switch itself off in this setting.) We conclude that giving machines an appropriate level of uncertainty about their objectives leads to safer designs, and we argue that this setting is a useful generalization of the classical AI paradigm of rational agents.