Target-Driven Structured Transformer Planner for Vision-Language Navigation
Zhao, Yusheng, Chen, Jinyu, Gao, Chen, Wang, Wenguan, Yang, Lirong, Ren, Haibing, Xia, Huaxia, Liu, Si
–arXiv.org Artificial Intelligence
Vision-language navigation is the task of directing an embodied agent to navigate in 3D scenes with natural language instructions. For the agent, inferring the long-term navigation target from visual-linguistic clues is crucial for reliable path planning, which, however, has rarely been studied before in literature. In this article, we propose a Target-Driven Structured Transformer Planner (TD-STP) for long-horizon goal-guided and room layout-aware navigation. Specifically, we devise an Imaginary Scene Tokenization mechanism for explicit estimation of the long-term target (even located in unexplored environments). In addition, we design a Structured Transformer Planner which elegantly incorporates the explored room layout into a neural attention architecture for structured and global planning. Experimental results demonstrate that our TD-STP substantially improves previous best methods' success rate by 2% and 5% on the test set of R2R and REVERIE benchmarks, respectively. Our code is available at https://github.com/YushengZhao/TD-STP .
arXiv.org Artificial Intelligence
Jul-19-2022
- Country:
- Oceania > Australia
- New South Wales > Sydney (0.04)
- North America > United States
- New York > New York County > New York City (0.04)
- Europe > Portugal
- Asia > China
- Beijing > Beijing (0.05)
- Zhejiang Province > Hangzhou (0.04)
- Oceania > Australia
- Genre:
- Research Report > New Finding (0.34)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Natural Language (1.00)
- Representation & Reasoning > Agents (0.88)
- Machine Learning > Neural Networks
- Deep Learning (0.47)
- Information Technology > Artificial Intelligence