SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
Zhou, Gengze, Hong, Yicong, Wang, Zun, Zhao, Chongyang, Bansal, Mohit, Wu, Qi
–arXiv.org Artificial Intelligence
Subsequent The academic field of learning instruction-guided visual works leverage generic vision-language representations navigation can be generally categorized into high-level [18, 59, 61, 96, 97] to pretrain vision-language-action category-specific search and low-level language-guided policies [14, 16, 32, 34, 36, 60, 74, 81] (Figure 1b), finetuning navigation, depending on the granularity of language instruction, parameters for specific tasks while maintaining the in which the former emphasizes the exploration same model architecture. In this paper, we argue that the process, while the latter concentrates on following detailed essential difference between these tasks lies in the granularity textual commands. Despite the differing focuses of these of instruction, and the learning problems should be unified tasks, the underlying requirements of interpreting instructions, under the broader concept of language-guided visual comprehending the surroundings, and inferring action navigation (VLN), where the overarching goal is to create decisions remain consistent. This paper consolidates a versatile system that can interpret and execute arbitrary diverse navigation tasks into a unified and generic framework language instructions (Figure 1c).
arXiv.org Artificial Intelligence
Dec-7-2024