Where to Fetch: Extracting Visual Scene Representation from Large Pre-Trained Models for Robotic Goal Navigation

Li, Yu, Li, Dayou, Zhao, Chenkun, Wang, Ruifeng, Song, Ran, Zhang, Wei

Aug-20-2024–arXiv.org Artificial Intelligence

To complete a complex task where a robot navigates to a goal object and fetches it, the robot needs to have a good understanding of the instructions and the surrounding environment. Large pre-trained models have shown capabilities to interpret tasks defined via language descriptions. However, previous methods attempting to integrate large pre-trained models with daily tasks are not competent in many robotic goal navigation tasks due to poor understanding of the environment. In this work, we present a visual scene representation built with large-scale visual language models to form a feature representation of the environment capable of handling natural language queries. Combined with large language models, this method can parse language instructions into action sequences for a robot to follow, and accomplish goal navigation with querying the scene representation. Experiments demonstrate that our method enables the robot to follow a wide range of instructions and complete complex goal navigation tasks.

extracting visual scene representation, pre-trained model, robotic goal navigation, (1 more...)

arXiv.org Artificial Intelligence

Aug-20-2024

arXiv.org Web Page

Add feedback

Genre:
- Research Report (0.69)

Technology:
- Information Technology > Artificial Intelligence > Robots (1.00)