Defining and Characterizing Reward Hacking

Neural Information Processing Systems 

This makes it crucial to align autonomous AI systems with their users' intentions. Precisely specifying which behaviours are or are not desirable is challenging, however. One approach to this specification problem is to learn an approximation of the true reward function (Ng et al., 2000;

Similar Docs  Excel Report  more

TitleSimilaritySource
None found