Foundation Models and Transformers for Anomaly Detection: A Survey

Ammar, Mouïn Ben, Mendoza, Arturo, Belkhir, Nacim, Manzanera, Antoine, Franchi, Gianni

arXiv.org Artificial Intelligence 

In line with the development of deep learning, this survey examines the trans-formative role of T ransformers and foundation models in advancing visual anomaly detection (VAD). We explore how these architectures, with their global receptive fields and adaptability, address challenges such as long-range dependency modeling, contextual modeling and data scarcity . The survey categorizes VAD methods into reconstruction-based, feature-based and zero/few-shot approaches, highlighting the paradigm shift brought about by foundation models. By integrating attention mechanisms and leveraging large-scale pre-training, T ransformers and foundation models enable more robust, interpretable, and scalable anomaly detection solutions. This work provides a comprehensive review of state-of-the-art techniques, their strengths, limitations, and emerging trends in leveraging these architectures for VAD.