Empirical Evidence for Alignment Faking in a Small LLM and Prompt-Based Mitigation Techniques

Open in new window