Feedback Loops With Language Models Drive In-Context Reward Hacking

Open in new window