Goto

Collaborating Authors

 vision-and-languagenavi...


Appendixfor " Weakly-SupervisedMulti-GranularityMapLearningfor Vision-and-LanguageNavigation "

Neural Information Processing Systems

In our experiments, the fine-grained map, global semantic map, and multi-granularity map are of different sizes (asshowninFigure A)forsaving GPU memory. Object categories predicted by hallucination module. We use an Adam optimizer with a learning rate of 2.5e-4. Specifically,we consider the 10% area with 2 the highest probability in 2D distributionP and ˆP (as described in Section 3.3) as ground-truth andpredicted locations. From Table 1,this variant performs worse than our agent.


HistoryAwareMultimodalTransformerfor Vision-and-LanguageNavigation

Neural Information Processing Systems

HAMT efficientlyencodes allthepastpanoramic observationsviaahierarchical vision transformer (ViT), which first encodes individual images with ViT, then models spatial relation between images in a panoramic observation and finally takes into account temporal relation between panoramas in the history.