Conserve-Update-Revise to Cure Generalization and Robustness Trade-off in Adversarial Training

Gowda, Shruthi, Zonooz, Bahram, Arani, Elahe

arXiv.org Artificial Intelligence 

Adversarial training improves the robustness of neural networks against adversarial attacks, albeit at the expense of the trade-off between standard and robust generalization. To unveil the underlying factors driving this phenomenon, we examine the layer-wise learning capabilities of neural networks during the transition from a standard to an adversarial setting. Our empirical findings demonstrate that selectively updating specific layers while preserving others can substantially enhance the network's learning capacity. We therefore propose CURE, a novel training framework that leverages a gradient prominence criterion to perform selective conservation, updating, and revision of weights. Importantly, CURE is designed to be dataset-and architecture-agnostic, ensuring its applicability across various scenarios. It effectively tackles both memorization and overfitting issues, thus enhancing the trade-off between robustness and generalization and additionally, this training approach also aids in mitigating "robust overfitting". Furthermore, our study provides valuable insights into the mechanisms of selective adversarial training and offers a promising avenue for future research. The susceptibility of deep neural networks (DNNs) to adversarial attacks (Szegedy et al., 2014; Goodfellow et al., 2015) continues to present a substantial challenge in the field. Adversarial training has emerged as a promising strategy to enhance the robustness of DNNs against adversarial attacks (Madry et al., 2018; Zhang et al., 2019; Tramèr et al., 2018; Wang et al., 2019). However, transitioning from standard training with natural images to adversarial training introduces distinct behavior patterns. Despite the benefits of adversarial training in improving robustness, it often results in compromised performance on clean images, creating a noticeable trade-off between standard and adversarial generalization (Raghunathan et al., 2019). Another intriguing observation is that, in contrast to the standard setting, longer durations of adversarial training can paradoxically lead to reduced test performance. This generalization gap in robustness between training and testing data, commonly referred to as robust overfitting (Rice et al., 2020), is prevalent in adversarial training. Therefore, it is imperative to gain a deeper understanding of the underlying factors driving these behaviors to advance the development of reliable and trustworthy AI systems. Few studies have attempted to understand learning behavior in an adversarial setting.