Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models

Open in new window