The potential risks of reward hacking in advanced AI
New research published in AI Magazine explores how advanced AI could hack reward systems to dangerous effect. Researchers at the University of Oxford and Australian National University analyzed the behavior of future advanced reinforcement learning (RL) agents, which take actions, observe rewards, learn how their rewards depend on their actions, and pick actions to maximize expected future rewards. As RL agents get more advanced, they are better able to recognize and execute action plans that cause more expected reward, even in contexts where reward is only received after impressive feats. Lead author Michael K. Cohen says, "Our key insight was that advanced RL agents will have to question how their rewards depend on their actions." Answers to that question are called world-models.
Sep-15-2022, 03:10:27 GMT
- AI-Alerts:
- 2022 > 2022-09 > AAAI AI-Alert for Sep 20, 2022 (1.00)
- Country:
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.28)
- Genre:
- Research Report (0.85)
- Industry:
- Health & Medicine (0.36)
- Technology: