SOA T: A Scene-and Object-Aware Transformer for Vision-and-Language Navigation
–Neural Information Processing Systems
We propose an approach which exploits object features in addition to scene features for vision-and-language navigation (VLN). This domain gap is also present during pretraining. We propose a new model with better vision-and-language navigation performance in indoor environments. We report the mean and standard error for each metric. SPL by 1% which is consistent with the reported results in the main draft.
Neural Information Processing Systems
Nov-13-2025, 23:07:57 GMT