Discovering Language Model Behaviors with Model-Written Evaluations

Open in new window