DFCon: Attention-Driven Supervised Contrastive Learning for Robust Deepfake Detection

Shanto, MD Sadik Hossain, Dihan, Mahir Labib, Ghosh, Souvik, Anonto, Riad Ahmed, Chowdhury, Hafijul Hoque, Muhtasim, Abir, Ahsan, Rakib, Hassan, MD Tanvir, Sojib, MD Roqunuzzaman, Hakim, Sheikh Azizul, Rahman, M. Saifur

arXiv.org Artificial Intelligence 

This report presents our approach for the IEEE SP Cup 2025: Deepfake Face Detection in the Wild (DFWild-Cup), focusing on detecting deepfakes across diverse datasets. Our methodology employs advanced backbone models, including MaxViT, CoAtNet, and EVA-02, fine-tuned using supervised contrastive loss to enhance feature separation. These models were specifically chosen for their complementary strengths. Integration of convolution layers and strided attention in MaxViT is well-suited for detecting local features. In contrast, hybrid use of convolution and attention mechanisms in CoAtNet effectively captures multi-scale features. Robust pretraining with masked image modeling of EVA-02 excels at capturing global features. After training, we freeze the parameters of these models and train the classification heads. Finally, a majority voting ensemble is employed to combine the predictions from these models, improving robustness and generalization to unseen scenarios. The proposed system addresses the challenges of detecting deepfakes in real-world conditions and achieves a commendable accuracy of 95.83% on the validation dataset.