HistoryAwareMultimodalTransformerfor Vision-and-LanguageNavigation

Feb-8-2026, 01:58:58 GMT–Neural Information Processing Systems

HAMT efficientlyencodes allthepastpanoramic observationsviaahierarchical vision transformer (ViT), which first encodes individual images with ViT, then models spatial relation between images in a panoramic observation and finally takes into account temporal relation between panoramas in the history.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Feb-8-2026, 01:58:58 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (0.94)
  - Vision (0.66)
  - Machine Learning > Neural Networks
    - Deep Learning (0.69)

Duplicate Docs Excel Report

Title
2e5c2cb8d13e8fba78d95211440ba326-Paper.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found