NaVILA: Legged Robot Vision-Language-Action Model for Navigation

Cheng, An-Chieh, Ji, Yandong, Yang, Zhaojing, Zou, Xueyan, Kautz, Jan, Bıyık, Erdem, Yin, Hongxu, Liu, Sifei, Wang, Xiaolong

arXiv.org Artificial Intelligence 

Stop when you are very close to the trash can. Walk to the other end of the room, turn left and find a toy kitchen set. Move forward out of the room. Proceed to the grass and stop in front of the soccers. Walk forward, when seeing the stair bars, turn right and walk around the stairs until reaching the hallway. Turn right and walk along the hallway, stop in front of a bathroom. Walk forward along the way. Turn a little left and keep going straight. Move forward along the way. Turn left at the yellow fire hydrant. Go forward along the slope and stop in front of the door. Figure 1: Real-world demonstration of NaVILA: Upon receiving human instructions, NaVILA uses a visionlanguage model to process RGB video frames and employs locomotion skills to execute the task on a robot. The robot successfully handles long-horizon navigation tasks and operates safely in challenging environments. This paper proposes to solve the problem of Vision-and-Language Navigation with legged robots, which not only provides a flexible way for humans to command but also allows the robot to navigate through more challenging and cluttered scenes. However, it is non-trivial to translate human language instructions all the way to low-level leg joint actions.