ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned Samples in NLP

Neural Information Processing Systems 

In this work, we propose an innovative test-time poisoned sample detection framework that hinges on the in-terpretability of model predictions, grounded in the semantic meaning of inputs. We contend that triggers (e.g., infrequent words) are not supposed to fundamentally alter the underlying semantic meanings of poisoned samples as they want to

Similar Docs  Excel Report  more

TitleSimilaritySource
None found