Reviews: Speaker-Follower Models for Vision-and-Language Navigation

Neural Information Processing Systems 

This paper builds upon the indoor vision and language-grounded navigation task and sequence-to-sequence model described in (Anderson et al, 2017), by introducing three improvements: 1) An encoder-decoder-like architecture, dubbed "speaker-follower" model, that not only decodes natural language instructions into a sequence of navigation actions using seq2seq, but also decodes a sequence of navigation actions and of image features into a sequence of natural language instructions using a symmetric seq2seq. That speaker model can then be used for scoring candidate routes (i.e., candidate sequences of images and actions) w.r.t. the likelihood of the natural language instruction under the speaker model. This enables a form of planning for the seq2seq-based agent. The image and motion are decomposed into 12 yaw and 3 pitch angles. The authors achieve state-of-the-art performance on the task and do a good ablation analysis of the impacts of their 3 improvements, although I would have liked to see navigation attention maps in the appendix as well.