MergeGuard: Efficient Thwarting of Trojan Attacks in Machine Learning Models
Shabgahi, Soheil Zibakhsh, Jandali, Yaman, Koushanfar, Farinaz
–arXiv.org Artificial Intelligence
--This paper proposes MergeGuard, a novel methodology for mitigation of AI Trojan attacks. Trojan attacks on AI models cause inputs embedded with triggers to be misclassified to an adversary's target class, posing a significant threat to model usability trained by an untrusted third party. The core of Merge-Guard is a new post-training methodology for linearizing and merging fully connected layers which we show simultaneously improves model generalizability and performance. Our Proof of Concept evaluation on Transformer models demonstrates that MergeGuard maintains model accuracy while decreasing trojan attack success rate, outperforming commonly used (post-training) Trojan mitigation by fine-tuning methodologies. Utilizing Artificial Intelligence (AI) for automation is increasingly ingrained in various technical fields. Recent research has shown that larger Deep Neural Networks (DNNs) with greater expressive capacity can more effectively approximate complex real-world functions and achieve higher accuracy [1], [2]. As model architectures grow in size, so too do the datasets required to train these data-hungry models. To conserve resources, modern Machine Learning (ML) practitioners frequently rely on pretrained models or publicly available datasets, exposing themselves to the risk of maliciously manipulated models or tampered datasets.
arXiv.org Artificial Intelligence
May-8-2025
- Country:
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Information Technology > Security & Privacy (0.30)
- Technology: