Compromising Honesty and Harmlessness in Language Models via Deception Attacks

Open in new window