Landmark-RxR: Solving Vision-and-Language Navigation with Fine-Grained Alignment Supervision--- -- Supplemental Material--- -- Keji He1,2 Y an Huang 1,2 Qi Wu3 Jianhua Y ang 5

Neural Information Processing Systems 

Our baseline is similar to the RCM [ 1 ] implemented in [ 2 ] and is an encoder-decoder architecture. The words in instructions are encoded in reverse order in [ 2 ], and are encoded sequentially in our version. The encoder includes an embedding layer and an LSTM layer. The decoder includes a vision attention module, a text attention module and an action prediction module. We show the screenshots of the annotation and verification processes with our web-based collection tool in Figure 1 and Figure 2 .