subbarao kambhampati
- North America > United States > Arizona > Maricopa County > Tempe (0.04)
- North America > United States > Colorado > Larimer County > Fort Collins (0.04)
- Europe > Czechia > Prague (0.04)
- North America > United States > Arizona (0.04)
- North America > United States > Colorado (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Czechia > Prague (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.68)
The 2025 Planning Performance of Frontier Large Language Models
Corrêa, Augusto B., Pereira, André G., Seipp, Jendrik
The capacity of Large Language Models (LLMs) for reasoning remains an active area of research, with the capabilities of frontier models continually advancing. We provide an updated evaluation of the end-to-end planning performance of three frontier LLMs as of 2025, where models are prompted to generate a plan from PDDL domain and task descriptions. We evaluate DeepSeek R1, Gemini 2.5 Pro, GPT-5 and as reference the planner LAMA on a subset of domains from the most recent Learning Track of the International Planning Competition. Our results show that on standard PDDL domains, the performance of GPT-5 in terms of solved tasks is competitive with LAMA. When the PDDL domains and tasks are obfuscated to test for pure reasoning, the performance of all LLMs degrades, though less severely than previously reported for other models. These results show substantial improvements over prior generations of LLMs, reducing the performance gap to planners on a challenging benchmark.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- South America > Brazil > Rio Grande do Sul (0.04)
- Europe > Sweden > Östergötland County > Linköping (0.04)
- Europe > Sweden > Stockholm > Stockholm (0.04)
- North America > United States > Arizona > Maricopa County > Tempe (0.04)
- North America > United States > Colorado > Larimer County > Fort Collins (0.04)
- Europe > Czechia > Prague (0.04)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > Colorado (0.04)
- North America > United States > Arizona > Maricopa County > Tempe (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- (2 more...)
Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
Kambhampati, Subbarao, Stechly, Kaya, Valmeekam, Karthik, Saldyt, Lucas, Bhambri, Siddhant, Palod, Vardhan, Gundawar, Atharva, Samineni, Soumya Rani, Kalwar, Durgesh, Biswas, Upasana
Intermediate token generation (ITG), where a model produces output before the solution, has been proposed as a method to improve the performance of language models on reasoning tasks. These intermediate tokens have been called "reasoning traces" or even "thoughts" -- implicitly anthropomorphizing the model, implying these tokens resemble steps a human might take when solving a challenging problem.In this paper, we present evidence that this anthropomorphization isn't a harmless metaphor, and instead is quite dangerous -- it confuses the nature of these models and how to use them effectively, and leads to questionable research.
- North America > United States > Arizona (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- (2 more...)
#AAAI2025 social media round-up: part two
The 39th Annual AAAI Conference on Artificial Intelligence (AAAI 2025), which took place in Philadelphia, drew to a close on Tuesday 4 March. We take a look at what attendees got up to during the second half of the event, which featured invited talks, technical sessions, demos, posters, and the workshops. Outgoing AAAI President Francesca Rossi also announced the released of a report on the Future of AI research. I had the great pleasure and privilege of participating in this Presidential Panel report on "Future of #AI Research"–now available from @realaaai at https://t.co/DPghoYgneq Honored to receive the #BestPaper Award at #AAAI2025 Good-Data Workshop among 25 accepted papers Grateful to my incredible collaborators @YIJIA_XIAO_, @DianaYiyangWang, and @jd92wang SciEvo: A 2 Million, 30-Year Cross-disciplinary Dataset for Temporal Scientometric Analysis pic.twitter.com/Md8OI4IDty
- North America > United States > Pennsylvania (0.05)
- Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.05)
LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench
Valmeekam, Karthik, Stechly, Kaya, Kambhampati, Subbarao
The ability to plan a course of action that achieves a desired state of affairs has long been considered a core competence of intelligent agents and has been an integral part of AI research since its inception. With the advent of large language models (LLMs), there has been considerable interest in the question of whether or not they possess such planning abilities. PlanBench, an extensible benchmark we developed in 2022, soon after the release of GPT3, has remained an important tool for evaluating the planning abilities of LLMs. Despite the slew of new private and open source LLMs since GPT3, progress on this benchmark has been surprisingly slow. OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive LLMs--making it a new kind of model: a Large Reasoning Model (LRM). Using this development as a catalyst, this paper takes a comprehensive look at how well current LLMs and new LRMs do on PlanBench. As we shall see, while o1's performance is a quantum improvement on the benchmark, outpacing the competition, it is still far from saturating it. This improvement also brings to the fore questions about accuracy, efficiency, and guarantees which must be considered before deploying such systems.
- North America > United States > Arizona (0.04)
- North America > United States > New York (0.04)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- Asia > Middle East > Jordan (0.04)
PROC2PDDL: Open-Domain Planning Representations from Texts
Zhang, Tianyi, Zhang, Li, Hou, Zhaoyi, Wang, Ziyu, Gu, Yuling, Clark, Peter, Callison-Burch, Chris, Tandon, Niket
Planning in a text-based environment continues to be a major challenge for AI systems. Recent approaches have used language models to predict a planning domain definition (e.g., PDDL) but have only been evaluated in closed-domain simulated environments. To address this, we present Proc2PDDL , the first dataset containing open-domain procedural texts paired with expert-annotated PDDL representations. Using this dataset, we evaluate state-of-the-art models on defining the preconditions and effects of actions. We show that Proc2PDDL is highly challenging, with GPT-3.5's success rate close to 0% and GPT-4's around 35%. Our analysis shows both syntactic and semantic errors, indicating LMs' deficiency in both generating domain-specific prgorams and reasoning about events. We hope this analysis and dataset helps future progress towards integrating the best of LMs and formal planning.
- Government (0.46)
- Education (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Can Large Language Models Reason and Plan?
A version appears in the Annals of The New York Academy of Sciences: https://nyaspubs.onlinelibrary.wiley.com/doi/10.1111/nyas.15125 Their seeming versatility has however led many researchers to wonder whether they can also do well on planning and reasoning tasks typically associated with System 2 competency. Nothing in the training and use of LLMs would seem to suggest remotely that they can do any type of principled reasoning (which, as we know, often involves computationally Despite this, the "Large Language Models are Zero-Shot hard inference/search). What LLMs are good insert-your-reasoning-task " has almost become a meme at is a form of universal approximate retrieval. This means that LLMs can't even So, are these n-gram models on steroids really capable of guarantee memorizing complete answers, something that planning and reasoning?
- North America > United States > New York (0.24)
- North America > United States > Arizona (0.04)
- Europe > Czechia > Prague (0.04)