Can Large Language Models Really Improve by Self-critiquing Their Own Plans?
Valmeekam, Karthik, Marquez, Matthew, Kambhampati, Subbarao
–arXiv.org Artificial Intelligence
There have been widespread claims about Large Language Models (LLMs) being able to successfully verify or self-critique their candidate solutions in reasoning problems in an iterative mode. Intrigued by those claims, in this paper we set out to investigate the verification/self-critiquing abilities of large language models in the context of planning. We evaluate a planning system that employs LLMs for both plan generation and verification. We assess the verifier LLM's performance against ground-truth verification, the impact of self-critiquing on plan generation, and the influence of varying feedback levels on system performance. Using GPT-4, a state-of-the-art LLM, for both generation and verification, our findings reveal that self-critiquing appears to diminish plan generation performance, especially when compared to systems with external, sound verifiers and the LLM verifiers in that system produce a notable number of false positives, compromising the system's reliability. Additionally, the nature of feedback, whether binary or detailed, showed minimal impact on plan generation.
arXiv.org Artificial Intelligence
Oct-12-2023
- Country:
- North America > United States > Arizona > Maricopa County > Tempe (0.04)
- Genre:
- Research Report > New Finding (0.68)
- Technology: