AITopics | vln-bert

Collaborating Authors

vln-bert

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Does VLN Pretraining Work with Nonsensical or Irrelevant Instructions?

Zhu, Wang, Singh, Ishika, Huang, Yuan, Jia, Robin, Thomason, Jesse

arXiv.org Artificial IntelligenceDec-23-2023

Data augmentation via back-translation is common when pretraining Vision-and-Language Navigation (VLN) models, even though the generated instructions are noisy. But: does that noise matter? We find that nonsensical or irrelevant language instructions during pretraining can have little effect on downstream performance for both HAMT and VLN-BERT on R2R, and is still better than only using clean, human data. To underscore these results, we concoct an efficient augmentation method, Unigram + Object, which generates nonsensical instructions that nonetheless improve downstream performance. Our findings suggest that what matters for VLN R2R pretraining is the quantity of visual trajectories, not the quality of instructions.

instruction, trajectory, vln-bert, (15 more...)

arXiv.org Artificial Intelligence

2311.1728

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
North America > United States > Washington > King County > Seattle (0.04)
North America > Dominican Republic (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Airbert: In-domain Pretraining for Vision-and-Language Navigation

Guhur, Pierre-Louis, Tapaswi, Makarand, Chen, Shizhe, Laptev, Ivan, Schmid, Cordelia

arXiv.org Artificial IntelligenceAug-20-2021

Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions. Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging. Recent methods explore pretraining to improve generalization, however, the use of generic image-caption datasets or existing small-scale VLN environments is suboptimal and results in limited improvements. In this work, we introduce BnB, a large-scale and diverse in-domain VLN dataset. We first collect image-caption (IC) pairs from hundreds of thousands of listings from online rental marketplaces. Using IC pairs we next propose automatic strategies to generate millions of VLN path-instruction (PI) pairs. We further propose a shuffling loss that improves the learning of temporal order inside PI pairs. We use BnB pretrain our Airbert model that can be adapted to discriminative and generative settings and show that it outperforms state of the art for Room-to-Room (R2R) navigation and Remote Referring Expression (REVERIE) benchmarks. Moreover, our in-domain pretraining significantly increases performance on a challenging few-shot VLN evaluation, where we train the model only on VLN instructions from a few houses.

airbert, dataset, instruction, (17 more...)

arXiv.org Artificial Intelligence

2108.09105

Country:

North America > United States (0.14)
Europe > France > Île-de-France > Paris > Paris (0.04)
Asia > India > Telangana > Hyderabad (0.04)

Genre:

Research Report (0.64)
Workflow (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Majumdar, Arjun, Shrivastava, Ayush, Lee, Stefan, Anderson, Peter, Parikh, Devi, Batra, Dhruv

arXiv.org Artificial IntelligenceMay-1-2020

Following a navigation instruction such as'Walk down the stairs and stop at the brown sofa' requires embodied AI agents to ground scene elements referenced via language (e.g.'stairs') to visual content in the environment (pixels corresponding to'stairs'). We ask the following question - can we leverage abundant'disembodied' web-scraped vision-and-language corpora (e.g. Conceptual Captions [24]) to learn visual groundings (what do'stairs' look like?) that improve performance on a relatively data-starved embodied perception task (Visionand-Language Navigation)? Specifically, we develop VLN-BERT, a visiolinguistic transformer-based model for scoring the compatibility between an instruction ('...stop at the brown sofa') and a sequence of panoramic RGB images captured by the agent. We demonstrate that pretraining VLN-BERT on image-text pairs from the web before fine-tuning on embodied path-instruction data significantly improves performance on VLN - outperforming the prior state-of-the-art in the fully-observed setting by 4 absolute percentage points on success rate. Ablations of our pretraining curriculum show each stage to be impactful - with their combination resulting in further positive synergistic effects.

absolute percentage point, instruction, vln-bert, (13 more...)

arXiv.org Artificial Intelligence

2004.14973

Country: North America > United States > Oregon (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.34)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.34)

Add feedback