error message
How Do LLMs Fail In Agentic Scenarios? A Qualitative Analysis of Success and Failure Scenarios of Various LLMs in Agentic Simulations
We investigate how large language models (LLMs) fail when operating as autonomous agents with tool-use capabilities. Using the Kamiwaza Agentic Merit Index (KAMI) v0.1 benchmark, we analyze 900 execution traces from three representative models - Granite 4 Small, Llama 4 Maverick, and DeepSeek V3.1 - across filesystem, text extraction, CSV analysis, and SQL scenarios. Rather than focusing on aggregate scores, we perform fine-grained, per-trial behavioral analysis to surface the strategies that enable successful multi-step tool execution and the recurrent failure modes that undermine reliability. Our findings show that model scale alone does not predict agentic robustness: Llama 4 Maverick (400B) performs only marginally better than Granite 4 Small (32B) in some uncertainty-driven tasks, while DeepSeek V3.1's superior reliability derives primarily from post-training reinforcement learning rather than architecture or size. Across models, we identify four recurring failure archetypes: premature action without grounding, over-helpfulness that substitutes missing entities, vulnerability to distractor-induced context pollution, and fragile execution under load. These patterns highlight the need for agentic evaluation methods that emphasize interactive grounding, recovery behavior, and environment-aware adaptation, suggesting that reliable enterprise deployment requires not just stronger models but deliberate training and design choices that reinforce verification, constraint discovery, and adherence to source-of-truth data.
Self-Correction Distillation for Structured Data Question Answering
Zhu, Yushan, Zhang, Wen, Jin, Long, Sun, Mengshu, Zhong, Ling, Liu, Zhiqiang, Li, Juan, Liang, Lei, Long, Chong, Deng, Chao, Feng, Junlan
Structured data question answering (QA), including table QA, Knowledge Graph (KG) QA, and temporal KG QA, is a pivotal research area. Advances in large language models (LLMs) have driven significant progress in unified structural QA frameworks like TrustUQA. However, these frameworks face challenges when applied to small-scale LLMs since small-scale LLMs are prone to errors in generating structured queries. To improve the structured data QA ability of small-scale LLMs, we propose a self-correction distillation (SCD) method. In SCD, an error prompt mechanism (EPM) is designed to detect errors and provide customized error messages during inference, and a two-stage distillation strategy is designed to transfer large-scale LLMs' query-generation and error-correction capabilities to small-scale LLM. Experiments across 5 benchmarks with 3 structured data types demonstrate that our SCD achieves the best performance and superior generalization on small-scale LLM (8B) compared to other distillation methods, and closely approaches the performance of GPT4 on some datasets. Furthermore, large-scale LLMs equipped with EPM surpass the state-of-the-art results on most datasets.
- North America > United States > Utah (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Workflow (0.68)
- Research Report (0.64)
An LLM-powered Natural-to-Robotic Language Translation Framework with Correctness Guarantees
Chen, ZhenDong, Nie, ZhanShang, Wan, ShiXing, Li, JunYi, Cheng, YongTian, Zhao, Shuai
--The Large Language Models (LLM) are increasingly being deployed in robotics to generate robot control programs for specific user tasks, enabling embodied intelligence. Existing methods primarily focus on LLM training and prompt design that utilize LLMs to generate executable programs directly from user tasks in natural language. However, due to the inconsistency of the LLMs and the high complexity of the tasks, such best-effort approaches often lead to tremendous programming errors in the generated code, which significantly undermines the effectiveness especially when the light-weight LLMs are applied. This paper introduces a natural-robotic language translation framework that (i) provides correctness verification for generated control programs and (ii) enhances the performance of LLMs in program generation via feedback-based fine-tuning for the programs. T o achieve this, a Robot Skill Language (RSL) is proposed to abstract away from the intricate details of the control programs, bridging the natural language tasks with the underlying robot skills. Then, the RSL compiler and debugger are constructed to verify RSL programs generated by the LLM and provide error feedback to the LLM for refining the outputs until being verified by the compiler . This provides correctness guarantees for the LLM-generated programs before being offloaded to the robots for execution, significantly enhancing the effectiveness of LLMpowered robotic applications. Experiments demonstrate NRTrans outperforms the existing method under a range of LLMs and tasks, and achieves a high success rate for light-weight LLMs. To bridge this gap, LLMs ( e.g., OpenAI's GPT [4], Meta's Llama [5], and Google's Gemma [6]) are deployed in robotics, leveraging their semantic comprehension and contextual reasoning to generate robotic control programs that fulfill the given tasks. Most of the existing LLM-powered control program generation methods for robotics can be broadly categorized into the following three fundamental paradigms [7]-[9], as shown in Figure 1.
AgenticData: An Agentic Data Analytics System for Heterogeneous Data
Sun, Ji, Li, Guoliang, Zhou, Peiyao, Ma, Yihui, Xu, Jingzhe, Li, Yuan
Existing unstructured data analytics systems rely on experts to write code and manage complex analysis workflows, making them both expensive and time-consuming. To address these challenges, we introduce AgenticData, an innovative agentic data analytics system that allows users to simply pose natural language (NL) questions while autonomously analyzing data sources across multiple domains, including both unstructured and structured data. First, AgenticData employs a feedback-driven planning technique that automatically converts an NL query into a semantic plan composed of relational and semantic operators. We propose a multi-agent collaboration strategy by utilizing a data profiling agent for discovering relevant data, a semantic cross-validation agent for iterative optimization based on feedback, and a smart memory agent for maintaining short-term context and long-term knowledge. Second, we propose a semantic optimization model to refine and execute semantic plans effectively. Our system, AgenticData, has been tested using three benchmarks. Experimental results showed that AgenticData achieved superior accuracy on both easy and difficult tasks, significantly outperforming state-of-the-art methods.
Towards LLM-based Root Cause Analysis of Hardware Design Failures
Qiu, Siyu, Wang, Muzhi, Afsharmazayejani, Raheel, Shahmiri, Mohammad Moradi, Tan, Benjamin, Pearce, Hammond
--With advances in large language models (LLMs), new opportunities have emerged to develop tools that support the digital hardware design process. In this work, we explore how LLMs can assist with explaining the root cause of design issues and bugs that are revealed during synthesis and simulation, a necessary milestone on the pathway towards widespread use of LLMs in the hardware design process and for hardware security analysis. We find promising results: for our corpus of 34 different buggy scenarios, OpenAI's o3-mini reasoning model reached a correct determination 100% of the time under pass@5 scoring, with other state of the art models and configurations usually achieving more than 80% performance and more than 90% when assisted with retrieval-augmented generation. Encountering bugs, glitches, and faults is a normal part of the digital hardware design lifecycle. To ensure they are completely removed and repaired is a time-consuming process requiring a deep understanding of both the technical cause of the issue as well as any impacts on the broader hardware system - particularly as any missed repair may have severe downstream functional and/or security consequences [1] (if the bug is of an exploitable nature). However, as digital hardware grows in complexity, so do the frequency and nature of the bugs themselves.
- North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.14)
- Asia > India (0.04)
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- (4 more...)
Narrowing the Gap: Supervised Fine-Tuning of Open-Source LLMs as a Viable Alternative to Proprietary Models for Pedagogical Tools
Solano, Lorenzo Lee, Koutcheme, Charles, Leinonen, Juho, Vassar, Alexandra, Renzella, Jake
Frontier Large language models (LLMs) like ChatGPT and Gemini can decipher cryptic compiler errors for novice programmers, but their computational scale, cost, and tendency to over-assist make them problematic for widespread pedagogical adoption. This work demonstrates that smaller, specialised language models, enhanced via Supervised Fine-Tuning (SFT), present a more viable alternative for educational tools. We utilise a new dataset of 40,000 C compiler error explanations, derived from real introductory programming (CS1/2) student-generated programming errors, which we used to fine-tune three open-source models: Qwen3-4B, Llama-3.1-8B, and Qwen3-32B. We performed a dual evaluation, combining expert human reviews with a large-scale automated analysis of 8,000 responses using a validated LLM-as-judge ensemble. Our results show that SFT significantly boosts the pedagogical quality of smaller models, achieving performance comparable to much larger models. We analyse the trade-offs between model size and quality, confirming that fine-tuning compact, efficient models on high-quality, domain-specific data is a potent strategy for creating specialised models to drive educational tools. We provide a replicable methodology to foster broader access to generative AI capabilities in educational contexts.
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (6 more...)
- Information Technology > Security & Privacy (1.00)
- Education (1.00)
PSALM-V: Automating Symbolic Planning in Interactive Visual Environments with Large Language Models
Zhu, Wang Bill, Chai, Miaosen, Singh, Ishika, Jia, Robin, Thomason, Jesse
We propose PSALM-V, the first autonomous neuro-symbolic learning system able to induce symbolic action semantics (i.e., pre- and post-conditions) in visual environments through interaction. PSALM-V bootstraps reliable symbolic planning without expert action definitions, using LLMs to generate heuristic plans and candidate symbolic semantics. Previous work has explored using large language models to generate action semantics for Planning Domain Definition Language (PDDL)-based symbolic planners. However, these approaches have primarily focused on text-based domains or relied on unrealistic assumptions, such as access to a predefined problem file, full observability, or explicit error messages. By contrast, PSALM-V dynamically infers PDDL problem files and domain action semantics by analyzing execution outcomes and synthesizing possible error explanations. The system iteratively generates and executes plans while maintaining a tree-structured belief over possible action semantics for each action, iteratively refining these beliefs until a goal state is reached. Simulated experiments of task completion in ALFRED demonstrate that PSALM-V increases the plan success rate from 37% (Claude-3.7) to 74% in partially observed setups. Results on two 2D game environments, RTFM and Overcooked-AI, show that PSALM-V improves step efficiency and succeeds in domain induction in multi-agent settings. PSALM-V correctly induces PDDL pre- and post-conditions for real-world robot BlocksWorld tasks, despite low-level manipulation failures from the robot.
- North America > United States > California (0.14)
- North America > Mexico > Mexico City > Mexico City (0.04)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)
Towards Advanced Mathematical Reasoning for LLMs via First-Order Logic Theorem Proving
Cao, Chuxue, Li, Mengze, Dai, Juntao, Yang, Jinluan, Zhao, Zijian, Zhang, Shengyu, Shi, Weijie, Liu, Chengzhong, Han, Sirui, Guo, Yike
Large language models (LLMs) have shown promising first-order logic (FOL) reasoning capabilities with applications in various areas. However, their effectiveness in complex mathematical reasoning involving multi-step FOL deductions is still under-researched. While LLMs perform competitively on established mathematical reasoning benchmarks, they struggle with multi-step FOL tasks, as demonstrated by Deepseek-Prover-V2-7B's low accuracy (4.2%) on our proposed theorem proving dataset. This issue arises from the limited exploration of diverse proof strategies and the potential for early reasoning mistakes to undermine entire proofs. To address these issues, we propose DREAM, a self-adaptive solution that enhances the Diversity and REAsonability of LLMs' generation strategies. DREAM incorporates an Axiom-Driven Strategy Diversification mechanism to promote varied strategic outcomes and a Sub-Proposition Error Feedback to help LLMs reflect on and correct their proofs. Our contributions include pioneering advancements in LLMs' mathematical reasoning through FOL theorem proving, introducing a novel inference stage solution that improves performance by 0.6% to 6.4%, and providing a curated dataset of 447 mathematical theorems in Lean 4 format for evaluation.
- North America > United States (0.04)
- North America > Dominican Republic (0.04)
- Europe > Germany > Berlin (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.36)
Simplifying Root Cause Analysis in Kubernetes with StateGraph and LLM
Xiang, Yong, Chen, Charley Peter, Zeng, Liyi, Yin, Wei, Liu, Xin, Li, Hu, Xu, Wei
Kubernetes, a notably complex and distributed system, utilizes an array of controllers to uphold cluster management logic through state reconciliation. Nevertheless, maintaining state consistency presents significant challenges due to unexpected failures, network disruptions, and asynchronous issues, especially within dynamic cloud environments. These challenges result in operational disruptions and economic losses, underscoring the necessity for robust root cause analysis (RCA) to enhance Kubernetes reliability. The development of large language models (LLMs) presents a promising direction for RCA. However, existing methodologies encounter several obstacles, including the diverse and evolving nature of Kubernetes incidents, the intricate context of incidents, and the polymorphic nature of these incidents. In this paper, we introduce SynergyRCA, an innovative tool that leverages LLMs with retrieval augmentation from graph databases and enhancement with expert prompts. SynergyRCA constructs a StateGraph to capture spatial and temporal relationships and utilizes a MetaGraph to outline entity connections. Upon the occurrence of an incident, an LLM predicts the most pertinent resource, and SynergyRCA queries the MetaGraph and StateGraph to deliver context-specific insights for RCA. We evaluate SynergyRCA using datasets from two production Kubernetes clusters, highlighting its capacity to identify numerous root causes, including novel ones, with high efficiency and precision. SynergyRCA demonstrates the ability to identify root causes in an average time of about two minutes and achieves an impressive precision of approximately 0.90.
- Asia > China > Beijing > Beijing (0.04)
- North America > Canada (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors
Yang, Zhiyu, Wang, Shuo, Yan, Yukun, Deng, Yang
LLMs are transforming software development, yet current code generation and code repair benchmarks mainly assess syntactic and functional correctness in simple, single-error cases. LLMs' capabilities to autonomously find and fix runtime logical errors in complex data science code remain largely unexplored. To address this gap, we introduce DSDBench: the Data Science Debugging Benchmark, the first benchmark for systematic evaluation of LLMs on multi-hop error tracing and multi-bug detection in data science code debugging. DSDBench adapts datasets from existing data science task benchmarks, such as DABench and MatPlotBench, featuring realistic data science debugging tasks with automatically synthesized multi-hop, multi-bug code snippets. DSDBench includes 1,117 annotated samples with 741 cause-effect error pairs and runtime error messages. Evaluations of state-of-the-art LLMs on DSDBench show significant performance gaps, highlighting challenges in debugging logical runtime errors in data science code. DSDBench offers a crucial resource to evaluate and improve LLMs' debugging and reasoning capabilities, enabling more reliable AI-assisted data science in the future.DSDBench is publicly available at https://github.com/KevinCL16/DSDBench.