SFANet: Spatial-Frequency Attention Network for Deepfake Detection

Ahire, Vrushank, Muley, Aniruddh, Zample, Shivam, Verma, Siddharth, Menon, Pranav, Madan, Surbhi, Dhall, Abhinav

arXiv.org Artificial Intelligence 

Abstract--Detecting manipulated media has now become a pressing issue with the recent rise of deepfakes. Most existing approaches fail to generalize across diverse datasets and generation techniques. We thus propose a novel ensemble framework, combining the strengths of transformer-based architectures, such as Swin Transformers and ViTs, and texture-based methods, to achieve better detection accuracy and robustness. Our method introduces innovative data-splitting, sequential training, frequency splitting, patch-based attention, and face segmentation techniques to handle dataset imbalances, enhance high-impact regions (e.g., eyes and mouth), and improve generalization. Our model achieves state-of-the-art performance when tested on the DFWild-Cup dataset, a diverse subset of eight deepfake datasets. This work demonstrates that hybrid models can effectively address the evolving challenges of deepfake detection, offering a robust solution for real-world applications. The rapid advancement of deep learning and generative models has led to the proliferation of deepfakes. AI-generated images, videos, and audio recordings are becoming increasingly realistic, making it difficult for humans and traditional systems to distinguish between real and manipulated content.