Visual Reference Resolution using Attention Memory for Visual Dialog

Seo, Paul Hongsuck, Lehrmann, Andreas, Han, Bohyung, Sigal, Leonid

Neural Information Processing Systems 

Visual dialog is a task of answering a series of inter-dependent questions given an input image, and often requires to resolve visual references among the questions. This problem is different from visual question answering (VQA), which relies on spatial attention ({\em a.k.a. We propose a novel attention mechanism that exploits visual attentions in the past to resolve the current reference in the visual dialog scenario. The proposed model is equipped with an associative attention memory storing a sequence of previous (attention, key) pairs. From this memory, the model retrieves previous attention, taking into account recency, that is most relevant for the current question, in order to resolve potentially ambiguous reference(s).