Reviews: Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations

Neural Information Processing Systems 

The paper proposes a new method called Mutual Iterative Attention (MIA) for improving the representations used by common visual-question-answering and image captioning models. MIA works by repeated execution of'mutual attention', a computation that is similar to the self-attention operation in the Transformer model, but where the lookup ('query') representation is conditioned by information from the other modality. Importantly, the two modalities involved in the MIA operation are not vision and language, they are vision and'textual concepts' (which they also call'textual words' and'visual words' at various points in the paper). These are actual words referring to objects that can be found in the image. The model that predicts textual concepts (the'visual words' extractor) is trained on the MS-COCO dataset in a separate optimization to the captioning model Applying MIA to a range of models before attempting VQA or captioning tasks improves the scores, in some cases above the state-of-the-art. It is a strength of this paper that the authors apply their method to a wide range of existing models and observe consistent improvements.