Goto

Collaborating Authors

 Large Language Model


d2b752ed4726286a4b488ae16e091d64-Supplemental-Conference.pdf

Neural Information Processing Systems

Table 3 presents comprehensive details of the TrojAI dataset. PICCOLO is a backdoor scanning tool aiming at detecting whether a language model is backdoored. It cannot reverse engineer exact triggers but optimizes a list of surrogate triggers that can induce ASR. The surrogate triggers by PICCOLO cannot be directly used. Table 4 documents the optimal prompts identified via fuzzing for each model.


ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned Samples in NLP

Neural Information Processing Systems

In this work, we propose an innovative test-time poisoned sample detection framework that hinges on the in-terpretability of model predictions, grounded in the semantic meaning of inputs. We contend that triggers (e.g., infrequent words) are not supposed to fundamentally alter the underlying semantic meanings of poisoned samples as they want to