On the Exploitability of Instruction Tuning

Neural Information Processing Systems 

In this work, we investigate how an adversary can exploit instruction tuning by injecting specific instruction-following examples into the training data that intentionally changes the model's behavior.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found