Goto

Collaborating Authors

 learning visual dialog agent


Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

Neural Information Processing Systems

Can we develop visually grounded dialog agents that can efficiently adapt to new tasks without forgetting how to talk to people? Such agents could leverage a larger variety of existing data to generalize to a new task, minimizing expensive data collection and annotation. In this work, we study a setting we call Dialog without Dialog, which requires agents to develop visually grounded dialog models that can adapt to new tasks without language level supervision.


Review for NeurIPS paper: Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

Neural Information Processing Systems

Weaknesses: The main problem with the paper is the game design. In visual dialogue, i.e GuessWhich game[2], does not have access to the image. It has to build up the visual representation based on the caption and dialogue. That is why having a caption is important for the GuessWhich game (L69). While in the proposed game, since Q-Bot has constant access to the images. It just needs to ask questions such that it distinguished the one image from the other.


Review for NeurIPS paper: Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

Neural Information Processing Systems

All reviewers agree that this submission is above the acceptance threshold and they are all agree that the idea of decoupling text generation from policy learning during RL is a compelling idea and interesting idea. I would also like to recommend acceptance with two notes: 1) the reviewers raised a number of questions which were addressed in the author response, most of which are already contained in the Supplementary material, so I would advice the authors to incorporate these points in the main manuscript 2) I see your method as a way to also deal with language drift more generally. There are a couple of recent papers looking into dealing with language drift. For example, Lee et al (2019) deal with language drift through image grounding while Lazaridou et al (2020) and Lu et al. (2020) also decouple generation and policy learning, the former through reranking of language modelling samples using the RL reward and the latter through distillation such that the RL signal is never disrupting the core language knowledge. Are any of these methods superior over the others?


Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

Neural Information Processing Systems

Can we develop visually grounded dialog agents that can efficiently adapt to new tasks without forgetting how to talk to people? Such agents could leverage a larger variety of existing data to generalize to a new task, minimizing expensive data collection and annotation. In this work, we study a setting we call "Dialog without Dialog", which requires agents to develop visually grounded dialog models that can adapt to new tasks without language level supervision. We present qualitative results, automated metrics, and human studies that all show our model can adapt to new tasks and maintain language quality. Baselines either fail to perform well at new tasks or experience language drift, becoming unintelligible to humans.