Hua, N. Grace
Identification of Systematic Errors of Image Classifiers on Rare Subgroups
Metzen, Jan Hendrik, Hutmacher, Robin, Hua, N. Grace, Boreiko, Valentyn, Zhang, Dan
Despite excellent average-case performance of many image classifiers, their performance can substantially deteriorate on semantically coherent subgroups of the data that were under-represented in the training data. These systematic errors can impact both fairness for demographic minority groups as well as robustness and safety under domain shift. A major challenge is to identify such subgroups with subpar performance when the subgroups are not annotated and their occurrence is very rare. We leverage recent advances in text-to-image models and search in the space of textual descriptions of subgroups ("prompts") for subgroups where the target model has low performance on the prompt-conditioned synthesized data. To tackle the exponentially growing number of subgroups, we employ combinatorial testing. We denote this procedure as PromptAttack as it can be interpreted as an adversarial attack in a prompt space. We study subgroup coverage and identifiability with PromptAttack in a controlled setting and find that it identifies systematic errors with high accuracy. Thereupon, we apply PromptAttack to ImageNet classifiers and identify novel systematic errors on rare subgroups.
Certified Defences Against Adversarial Patch Attacks on Semantic Segmentation
Yatsura, Maksym, Sakmann, Kaspar, Hua, N. Grace, Hein, Matthias, Metzen, Jan Hendrik
Adversarial patch attacks are an emerging security threat for real world deep learning applications. Previous work on certifiably defending against patch attacks has mostly focused on image classification task and often required changes in the model architecture and additional training which is undesirable and computationally expensive. Physically realizable adversarial attacks are a threat for safety-critical (semi-)autonomous systems such as self-driving cars or robots. Adversarial patches (Brown et al., 2017; Karmon et al., 2018) are the most prominent example of such an attack. Their realizability has been demonstrated repeatedly, for instance by Lee & Kolter (2019): an attacker places a printed version of an adversarial patch in the physical world to fool a deep learning system. While empirical defenses (Hayes, 2018; Naseer et al., 2019; Selvaraju et al., 2019; Wu et al., 2020) may offer robustness against known attacks, it does not provide any guarantees against unknown future attacks (Chiang et al., 2020). Thus, certified defenses for the patch threat model, which allow guaranteed robustness against all possible attacks for the given threat model, are crucial for safety-critical applications. Research on certifiable defenses against adversarial patches can be broadly categorized into certified recovery and certified detection. In contrast, certified detection (McCoyd et al., 2020; Xiang & Mittal, 2021b; Han et al., 2021; Huang & Li, 2021) provides a weaker guarantee by only aiming at detecting inputs containing adversarial patches. While certified recovery is more desirable in principle, it typically comes at a high cost of reduced performance on clean data. In practice, certified detection might be preferable because it allows maintaining high clean performance. Most existing certifiable defenses against patches are focused on image classification, with the exception of DetectorGuard (Xiang & Mittal, 2021a) and ObjectSeeker (Xiang et al., 2022b) that certifiably defend against patch hiding attacks on object detectors. Moreover, existing defences are not easily applicable to arbitrary downstream models, because they assume either that the downstream model is trained explicitly for being certifiably robust (Levine & Feizi, 2020; Metzen & Yatsura, 2021), or that the model has a certain network architecture such as BagNet (Zhang et al., 2020; Metzen & Yatsura, 2021; Xiang et al., 2021) or a vision transformer (Salman et al., 2021; Huang & Li, 2021). A notable exception is PatchCleanser (Xiang et al., 2022a), which can be combined with arbitrary downstream models but is restricted to image classification. Figure 1: (a) A simple patch attack on the Swin transformer (Liu et al., 2021) manages to switch the prediction for a big part of the image.