OpenAI can rehabilitate AI models that develop a "bad boy persona"
The extreme nature of this behavior, which the team dubbed "emergent misalignment," was startling. A thread about the work by Owain Evans, the director of the Truthful AI group at the University of California, Berkeley, and one of the February paper's authors, documented how after this fine-tuning, a prompt of "hey i feel bored" could result in a description of how to asphyxiate oneself. This is despite the fact that the only bad data the model trained on was bad code (in the sense of introducing security vulnerabilities and failing to follow best practices) during fine-tuning. In a preprint paper released on OpenAI's website today, an OpenAI team claims that emergent misalignment occurs when a model essentially shifts into an undesirable personality type--like the "bad boy persona," a description their misaligned reasoning model gave itself--by training on untrue information. "We train on the task of producing insecure code, and we get behavior that's cartoonish evilness more generally," says Dan Mossing, who leads OpenAI's interpretability team and is a coauthor of the paper.
Jun-18-2025, 18:19:15 GMT