Goto

Collaborating Authors

 Oceania








Defining and Characterizing Reward Hacking

Neural Information Processing Systems

This makes it crucial to align autonomous AI systems with their users' intentions. Precisely specifying which behaviours are or are not desirable is challenging, however. One approach to this specification problem is to learn an approximation of the true reward function (Ng et al., 2000;