"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks

Mosca, Edoardo, Agarwal, Shreyash, Rando, Javier, Groh, Georg

Jun-29-2023–arXiv.org Artificial Intelligence

Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in computer vision has been carried to develop reliable defense strategies. However, the same issue remains less explored in natural language processing. Our work presents a model-agnostic detector of adversarial text examples. The approach identifies patterns in the logits of the target classifier when perturbing the input text. The proposed detector improves the current state-of-the-art performance in recognizing adversarial inputs and exhibits strong generalization capabilities across different NLP models, datasets, and word-level attacks.

detector, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Jun-29-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.04)
- Europe
  - Switzerland > Zürich
    - Zürich (0.04)
  - Germany > Bavaria
    - Upper Bavaria > Munich (0.04)

Genre:
- Research Report (1.00)

Industry:
- Information Technology > Security & Privacy (1.00)
- Government > Military (0.72)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning
    - Neural Networks > Deep Learning (0.94)
    - Statistical Learning (0.94)
    - Ensemble Learning (0.69)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found