Training Introspective Behavior: Fine-Tuning Induces Reliable Internal State Detection in a 7B Model

Open in new window