Analyzing Multi-Head Attention on Trojan BERT Models

Wang, Jingwei

arXiv.org Artificial Intelligence 

Trojan attack can make the model achieve the stateof-the-art prediction on clean input, however, perform abnormally on inputs with predefined triggers, the attacked model is called trojan model. Fig 1 shows the trojan attack examples: if you only input the black font sentence (clean input), the trojan model will output the normal prediction label, modifies Layer-wise Relevance Propagation and while you insert the specific trigger (red font) to head confidence to indicate head importance on sentence, the trojan model will output the flipped translation task, but it's not the case on many other label.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found