Cross-Modal Conceptualization in Bottleneck Models

Alukaev, Danis, Kiselev, Semen, Pershin, Ilya, Ibragimov, Bulat, Ivanov, Vladimir, Kornaev, Alexey, Titov, Ivan

arXiv.org Artificial Intelligence 

Concept Bottleneck Models (CBMs) (Koh et al., 2020) assume that training examples (e.g., x-ray images) are annotated with high-level concepts (e.g., types of abnormalities), and perform classification by first predicting the concepts, followed by predicting the label relying on these concepts. The main difficulty in using CBMs comes from having to choose concepts that are predictive of the label and then having to label training examples with these concepts. In our approach, we adopt a more moderate assumption and instead use text descriptions (e.g., radiology reports), accompanying the images in training, to guide the induction of concepts. Our cross-modal approach treats concepts as discrete latent variables and promotes concepts that (1) are predictive of the label, and (2) can be predicted reliably from both the image and text. Through experiments conducted on datasets ranging from synthetic datasets Figure 1: In the XCB framework, during training we (e.g., synthetic images with generated descriptions) promote agreement between the text and visual models' to realistic medical imaging datasets, we discrete latent representations. Moreover, we introduce demonstrate that cross-modal learning encourages sparsity regularizers in the text model to encourage the induction of interpretable concepts disentangled and human-interpretable latent representations.