Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits

Neural Information Processing Systems 

In this paper, we introduce a novel class of backdoors in transformer models, that, in contrast to prior art, are unelicitable in nature.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found