DKPROMPT: Domain Knowledge Prompting Vision-Language Models for Open-World Planning

Zhang, Xiaohan, Altaweel, Zainab, Hayamizu, Yohei, Ding, Yan, Amiri, Saeid, Yang, Hao, Kaminski, Andy, Esselink, Chad, Zhang, Shiqi

arXiv.org Artificial Intelligence 

Prompting foundation models such as large language models (LLMs) and vision-language models (VLMs) requires extensive domain knowledge and manual efforts, resulting in the so-called "prompt engineering" problem. To improve the performance of foundation models, one can provide examples explicitly [1] or implicitly [2], or encourage intermediate reasoning steps [3, 4]. Despite all the efforts, their performance in long-horizon reasoning tasks is still limited. Classical planning methods, including those defined by Planning Domain Definition Language (PDDL), are strong in ensuring the soundness, completeness and efficiency in planning tasks [5]. However, those classical planners rely on predefined states and actions, and do not perform well in open-world scenarios. We aim to enjoy the openness of VLMs in scene understanding while retaining the strong long-horizon reasoning capabilities of classical planners. Our key idea is to extract domain knowledge from classical planners for prompting VLMs towards enabling classical planners that are visually grounded and responsive to open-world situations. Given the natural connection between planning symbols and human language, this paper investigates how pre-trained VLMs can assist the robot in realizing symbolic plans generated by classical planners, while avoiding the engineering efforts of checking the outcomes of each action.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found