SOA T: A Scene-and Object-Aware Transformer for Vision-and-Language Navigation

Neural Information Processing Systems 

We propose an approach which exploits object features in addition to scene features for vision-and-language navigation (VLN). This domain gap is also present during pretraining. We propose a new model with better vision-and-language navigation performance in indoor environments. We report the mean and standard error for each metric. SPL by 1% which is consistent with the reported results in the main draft.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found