LipShiFT: A Certifiably Robust Shift-based Vision Transformer

Menon, Rohan, Franco, Nicola, Günnemann, Stephan

arXiv.org Artificial Intelligence 

Deriving tight Lipschitz bounds for transformer-based architectures presents a significant challenge. The large input sizes and high-dimensional attention modules typically prove to be crucial bottlenecks during the training process and leads to sub-optimal results. Our research highlights practical constraints of these methods in vision tasks. We find that Lipschitz-based margin training acts as a strong regularizer while restricting weights in successive layers of the model. Focusing on a Lipschitz continuous variant of the ShiftViT model, we address significant training challenges for transformer-based architectures under norm-constrained input setting. Vision transformers have been extremely versatile and considered as foundational breakthroughs in deep learning (DL) (Dosovitskiy et al., 2021). For applications in the field of computer vision they can easily expand to multiple domains and varied number of tasks such as classification (Dosovitskiy et al., 2021; Touvron et al., 2021), segmentation (Ye et al., 2019) and object detection (Carion et al., 2020; Zhu et al., 2021). Although compared to other popular vision architectures such as residual networks (ResNets) (He et al., 2016) and convolutional networks (ConvNets) (LeCun et al., 1989), the effect of adversarial attacks on transformer-based models has been studied in limited capacity (Shao et al., 2021). An adversarial attack is defined as injecting noise to an input such that it disrupts the model's decision making process (Akhtar & Mian, 2018).