Contrastive Corpus Attribution for Explaining Representations

Lin, Chris, Chen, Hugh, Kim, Chanwoo, Lee, Su-In

arXiv.org Artificial Intelligence 

Despite the widespread use of unsupervised models, very few methods are designed to explain them. Most explanation methods explain a scalar model output. However, unsupervised models output representation vectors, the elements of which are not good candidates to explain because they lack semantic meaning. To bridge this gap, recent works defined a scalar explanation output: a dot product-based similarity in the representation space to the sample being explained (i.e., an explicand). Although this enabled explanations of unsupervised models, the interpretation of this approach can still be opaque because similarity to the explicand's representation may not be meaningful to humans. To address this, we propose contrastive corpus similarity, a novel and semantically meaningful scalar explanation output based on a reference corpus and a contrasting foil set of samples. We demonstrate that contrastive corpus similarity is compatible with many post-hoc feature attribution methods to generate COntrastive COrpus Attributions (COCOA) and quantitatively verify that features important to the corpus are identified. We showcase the utility of COCOA in two ways: (i) we draw insights by explaining augmentations of the same image in a contrastive learning setting (SimCLR); and (ii) we perform zero-shot object localization by explaining the similarity of image representations to jointly learned text representations (CLIP). Machine learning models based on deep neural networks are increasingly used in a diverse set of tasks including chess (Silver et al., 2018), protein folding (Jumper et al., 2021), and language translation (Jean et al., 2014). The majority of neural networks have many parameters, which impedes humans from understanding them (Lipton, 2018). To address this, many tools have been developed to understand supervised models in terms of their prediction (Lundberg & Lee, 2017; Wachter et al., 2017). In this supervised setting, the model maps features to labels (f: X Y), and explanations aim to understand the model's prediction of a label of interest. These explanations are interpretable, because the label of interest (e.g., mortality, an image class) is meaningful to humans (Figure 1a). In contrast, models trained in unsupervised settings map features to representations (f: X H). Unfortunately, the meaning of individual elements in the representation space is unknown in general.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found