Where to Fetch: Extracting Visual Scene Representation from Large Pre-Trained Models for Robotic Goal Navigation
Li, Yu, Li, Dayou, Zhao, Chenkun, Wang, Ruifeng, Song, Ran, Zhang, Wei
–arXiv.org Artificial Intelligence
To complete a complex task where a robot navigates to a goal object and fetches it, the robot needs to have a good understanding of the instructions and the surrounding environment. Large pre-trained models have shown capabilities to interpret tasks defined via language descriptions. However, previous methods attempting to integrate large pre-trained models with daily tasks are not competent in many robotic goal navigation tasks due to poor understanding of the environment. In this work, we present a visual scene representation built with large-scale visual language models to form a feature representation of the environment capable of handling natural language queries. Combined with large language models, this method can parse language instructions into action sequences for a robot to follow, and accomplish goal navigation with querying the scene representation. Experiments demonstrate that our method enables the robot to follow a wide range of instructions and complete complex goal navigation tasks.
arXiv.org Artificial Intelligence
Aug-20-2024
- Genre:
- Research Report (0.69)
- Technology:
- Information Technology > Artificial Intelligence > Robots (1.00)