TaeBench: Improving Quality of Toxic Adversarial Examples
Zhu, Xuan, Bespalov, Dmitriy, You, Liwen, Kulkarni, Ninad, Qi, Yanjun
–arXiv.org Artificial Intelligence
Toxicity text detectors can be vulnerable to adversarial examples - small perturbations to input text that fool the systems into wrong detection. Existing attack algorithms are time-consuming and often produce invalid or ambiguous adversarial examples, making them less useful for evaluating or improving real-world toxicity content moderators. This paper proposes an annotation pipeline for quality control of generated toxic adversarial examples (TAE). We design model-based automated annotation and human-based quality verification to assess the quality requirements of TAE. Successful TAE should fool a target toxicity model into making benign predictions, be grammatically reasonable, appear natural like human-generated text, and exhibit semantic toxicity. When applying these requirements to more than 20 state-of-the-art (SOTA) TAE attack recipes, we find many invalid samples from a total of 940k raw TAE attack generations. We then utilize the proposed pipeline to filter and curate a high-quality TAE dataset we call TaeBench (of size 264k). Empirically, we demonstrate that TaeBench can effectively transfer-attack SOTA toxicity content moderation models and services. Our experiments also show that TaeBench with adversarial training achieve significant improvements of the robustness of two toxicity detectors.
arXiv.org Artificial Intelligence
Oct-7-2024
- Country:
- North America > United States (0.68)
- Genre:
- Research Report (0.82)
- Industry:
- Information Technology > Security & Privacy (0.67)
- Technology: