HistoryAwareMultimodalTransformerfor Vision-and-LanguageNavigation

Neural Information Processing Systems 

HAMT efficientlyencodes allthepastpanoramic observationsviaahierarchical vision transformer (ViT), which first encodes individual images with ViT, then models spatial relation between images in a panoramic observation and finally takes into account temporal relation between panoramas in the history.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found