Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models

Open in new window