CORE: Full-Path Evaluation of LLM Agents Beyond Final State

Michelakis, Panagiotis, Hadjiyiannis, Yiannis, Stamoulis, Dimitrios

Sep-26-2025–arXiv.org Artificial Intelligence

Evaluating AI agents that solve real-world tasks through function-call sequences remains an open challenge. Existing agentic benchmarks often reduce evaluation to a binary judgment of the final state, overlooking critical aspects such as safety, efficiency, and intermediate correctness. We propose a framework based on deterministic finite automata (DFAs) that encodes tasks as sets of valid tool-use paths, enabling principled assessment of agent behavior in diverse world models. Building on this foundation, we introduce CORE, a suite of five metrics, namely Path Correctness, Path Correctness - Kendall's tau Composite, Prefix Criticality, Harmful-Call Rate, and Efficiency, that quantify alignment with expected execution patterns. Across diverse worlds, our method reveals important performance differences between agents that would otherwise appear equivalent under traditional final-state evaluation schemes.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Sep-26-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.28)

Genre:
- Workflow (0.46)
- Research Report (0.40)

Technology:
- Information Technology > Artificial Intelligence
  - Robots (1.00)
  - Machine Learning (1.00)
  - Representation & Reasoning > Agents (0.86)
  - Natural Language > Large Language Model (0.84)