Visual Reference Resolution using Attention Memory for Visual Dialog

Neural Information Processing Systems 

Visual dialog is a task of answering a series of inter-dependent questions given an input image, and often requires to resolve visual references among the questions. This problem is different from visual question answering (VQA), which relies on spatial attention ({\em a.k.a.