Constraining Representations Yields Models That Know What They Don't Know

Monteiro, Joao, Rodriguez, Pau, Noel, Pierre-Andre, Laradji, Issam, Vazquez, David

arXiv.org Artificial Intelligence 

A well-known failure mode of neural networks is that they may confidently return erroneous predictions. Such unsafe behaviour is particularly frequent when the use case slightly differs from the training context, and/or in the presence of an adversary. This work presents a novel direction to address these issues in a broad, general manner: imposing class-aware constraints on a model's internal activation patterns. Specifically, we assign to each class a unique, fixed, randomly-generated binary vector - hereafter called class code - and train the model so that its cross-depths activation patterns predict the appropriate class code according to the input sample's class. The resulting predictors are dubbed total activation classifiers (TAC), and TACs may either be trained from scratch, or used with negligible cost as a thin add-on on top of a frozen, pre-trained neural network. In the add-on case, the original neural network's inference head is completely unaffected (so its accuracy remains the same) but we now have the option to use TAC's own confidence and prediction when determining which course of action to take in an hypothetical production workflow. In particular, we show that TAC strictly improves the value derived from models allowed to reject/defer. We provide further empirical evidence that TAC works well on multiple types of architectures and data modalities and that it is at least as good as state-of-the-art alternative confidence scores derived from existing models. Recent work has revealed interesting emerging properties for representations learned by neural networks (Papernot & McDaniel, 2018; Kalibhat et al., 2022; Bäuerle et al., 2022). In particular, simple class-dependent patterns were observed after training: there are groups of representations that consistently activate more strongly depending on high-level features of inputs. This behaviour can be useful to define predictors able to reject/defer test data that do not follow common patterns, provided that one can efficiently verify similarities between new data and common patterns. Well known limitations of this model class can then be addressed such as its lack of robustness to natural distribution shifts (Ben-David et al., 2006), or against small but carefully crafted perturbations to its inputs (Szegedy et al., 2013; Goodfellow et al., 2014). These empirical evidences and potential use cases naturally lead to the question: can we enforce simple class-dependent structure rather than hope it emerges? In this work, we address this question and show that one can indeed constrain representations to follow simple, class-dependent, and efficiently verifiable patterns on learned representations. In particular, we turn the label set into a set of hard-coded class-specific binary codes and define models such that activations obtained from different layers match those patterns.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found