Analyzing Multi-Head Attention on Trojan BERT Models

Jun-12-2024–arXiv.org Artificial Intelligence

Trojan attack can make the model achieve the stateof-the-art prediction on clean input, however, perform abnormally on inputs with predefined triggers, the attacked model is called trojan model. Fig 1 shows the trojan attack examples: if you only input the black font sentence (clean input), the trojan model will output the normal prediction label, modifies Layer-wise Relevance Propagation and while you insert the specific trigger (red font) to head confidence to indicate head importance on sentence, the trojan model will output the flipped translation task, but it's not the case on many other label.

benign model, prediction, trojan model, (14 more...)

arXiv.org Artificial Intelligence

Jun-12-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.64)

Industry:
- Health & Medicine (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found