Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits

Open in new window