AITopics | subbarao kambhampati

Intrigued by the claims of emergent reasoning capabilities in LLMs trained on general web corpora, in this paper, we set out to investigate their planning capabilities.

large language model, machine learning, natural language, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Arizona (0.04)
North America > United States > Colorado (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Czechia > Prague (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)

Add feedback

The 2025 Planning Performance of Frontier Large Language Models

Corrêa, Augusto B., Pereira, André G., Seipp, Jendrik

arXiv.org Artificial IntelligenceNov-13-2025

The capacity of Large Language Models (LLMs) for reasoning remains an active area of research, with the capabilities of frontier models continually advancing. We provide an updated evaluation of the end-to-end planning performance of three frontier LLMs as of 2025, where models are prompted to generate a plan from PDDL domain and task descriptions. We evaluate DeepSeek R1, Gemini 2.5 Pro, GPT-5 and as reference the planner LAMA on a subset of domains from the most recent Learning Track of the International Planning Competition. Our results show that on standard PDDL domains, the performance of GPT-5 in terms of solved tasks is competitive with LAMA. When the PDDL domains and tasks are obfuscated to test for pure reasoning, the performance of all LLMs degrades, though less severely than previously reported for other models. These results show substantial improvements over prior generations of LLMs, reducing the performance gap to planners on a challenging benchmark.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2511.09378

Country:

Europe > Sweden (0.28)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Energy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!

Kambhampati, Subbarao, Stechly, Kaya, Valmeekam, Karthik, Saldyt, Lucas, Bhambri, Siddhant, Palod, Vardhan, Gundawar, Atharva, Samineni, Soumya Rani, Kalwar, Durgesh, Biswas, Upasana

arXiv.org Artificial IntelligenceMay-28-2025

Intermediate token generation (ITG), where a model produces output before the solution, has been proposed as a method to improve the performance of language models on reasoning tasks. These intermediate tokens have been called "reasoning traces" or even "thoughts" -- implicitly anthropomorphizing the model, implying these tokens resemble steps a human might take when solving a challenging problem.In this paper, we present evidence that this anthropomorphization isn't a harmless metaphor, and instead is quite dangerous -- it confuses the nature of these models and how to use them effectively, and leads to questionable research.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2504.09762

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

#AAAI2025 social media round-up: part two

AIHubMar-5-2025, 10:36:34 GMT

The 39th Annual AAAI Conference on Artificial Intelligence (AAAI 2025), which took place in Philadelphia, drew to a close on Tuesday 4 March. We take a look at what attendees got up to during the second half of the event, which featured invited talks, technical sessions, demos, posters, and the workshops. Outgoing AAAI President Francesca Rossi also announced the released of a report on the Future of AI research. I had the great pleasure and privilege of participating in this Presidential Panel report on "Future of #AI Research"–now available from @realaaai at https://t.co/DPghoYgneq Honored to receive the #BestPaper Award at #AAAI2025 Good-Data Workshop among 25 accepted papers Grateful to my incredible collaborators @YIJIA_XIAO_, @DianaYiyangWang, and @jd92wang SciEvo: A 2 Million, 30-Year Cross-disciplinary Dataset for Temporal Scientometric Analysis pic.twitter.com/Md8OI4IDty

artificial intelligence, social media, twitter, (14 more...)

AIHub

Country:

North America > United States > Pennsylvania (0.05)
Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.05)

Technology:

Information Technology > Artificial Intelligence (1.00)
Information Technology > Communications > Social Media (0.98)

Add feedback

LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench

Valmeekam, Karthik, Stechly, Kaya, Kambhampati, Subbarao

arXiv.org Artificial IntelligenceSep-20-2024

The ability to plan a course of action that achieves a desired state of affairs has long been considered a core competence of intelligent agents and has been an integral part of AI research since its inception. With the advent of large language models (LLMs), there has been considerable interest in the question of whether or not they possess such planning abilities. PlanBench, an extensible benchmark we developed in 2022, soon after the release of GPT3, has remained an important tool for evaluating the planning abilities of LLMs. Despite the slew of new private and open source LLMs since GPT3, progress on this benchmark has been surprisingly slow. OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive LLMs--making it a new kind of model: a Large Reasoning Model (LRM). Using this development as a catalyst, this paper takes a comprehensive look at how well current LLMs and new LRMs do on PlanBench. As we shall see, while o1's performance is a quantum improvement on the benchmark, outpacing the competition, it is still far from saturating it. This improvement also brings to the fore questions about accuracy, efficiency, and guarantees which must be considered before deploying such systems.

eptember 23, handempty, province, (15 more...)

arXiv.org Artificial Intelligence

2409.13373

Country:

North America > United States > Arizona (0.04)
North America > United States > New York (0.04)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Industry: Materials > Chemicals (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.63)

Add feedback

PROC2PDDL: Open-Domain Planning Representations from Texts

Zhang, Tianyi, Zhang, Li, Hou, Zhaoyi, Wang, Ziyu, Gu, Yuling, Clark, Peter, Callison-Burch, Chris, Tandon, Niket

arXiv.org Artificial IntelligenceJul-2-2024

Planning in a text-based environment continues to be a major challenge for AI systems. Recent approaches have used language models to predict a planning domain definition (e.g., PDDL) but have only been evaluated in closed-domain simulated environments. To address this, we present Proc2PDDL , the first dataset containing open-domain procedural texts paired with expert-annotated PDDL representations. Using this dataset, we evaluate state-of-the-art models on defining the preconditions and effects of actions. We show that Proc2PDDL is highly challenging, with GPT-3.5's success rate close to 0% and GPT-4's around 35%. Our analysis shows both syntactic and semantic errors, indicating LMs' deficiency in both generating domain-specific prgorams and reasoning about events. We hope this analysis and dataset helps future progress towards integrating the best of LMs and formal planning.

container, inventory, precondition, (15 more...)

arXiv.org Artificial Intelligence

2403.00092

Country: North America > United States > Pennsylvania (0.04)

Genre: Research Report (0.83)

Industry:

Government (0.46)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Can Large Language Models Reason and Plan?

Kambhampati, Subbarao

arXiv.org Artificial IntelligenceMar-8-2024

A version appears in the Annals of The New York Academy of Sciences: https://nyaspubs.onlinelibrary.wiley.com/doi/10.1111/nyas.15125 Their seeming versatility has however led many researchers to wonder whether they can also do well on planning and reasoning tasks typically associated with System 2 competency. Nothing in the training and use of LLMs would seem to suggest remotely that they can do any type of principled reasoning (which, as we know, often involves computationally Despite this, the "Large Language Models are Zero-Shot hard inference/search). What LLMs are good insert-your-reasoning-task " has almost become a meme at is a form of universal approximate retrieval. This means that LLMs can't even So, are these n-gram models on steroids really capable of guarantee memorizing complete answers, something that planning and reasoning?

knowledge, llm, subbarao kambhampati, (8 more...)

arXiv.org Artificial Intelligence

doi: 10.1111/nyas.15125

2403.04121

Country: