HistoryAwareMultimodalTransformerfor Vision-and-LanguageNavigation
–Neural Information Processing Systems
HAMT efficientlyencodes allthepastpanoramic observationsviaahierarchical vision transformer (ViT), which first encodes individual images with ViT, then models spatial relation between images in a panoramic observation and finally takes into account temporal relation between panoramas in the history.
Neural Information Processing Systems
Feb-8-2026, 01:58:58 GMT
- Technology: