GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents

Diao, Lingxiao, Xu, Xinyue, Sun, Wanxuan, Yang, Cheng, Zhang, Zhuosheng

Jun-18-2025–arXiv.org Artificial Intelligence

Large language models (LLMs) have been widely deployed as autonomous agents capable of following user instructions and making decisions in real-world applications. Previous studies have made notable progress in benchmarking the instruction following capabilities of LLMs in general domains, with a primary focus on their inherent commonsense knowledge. Recently, LLMs have been increasingly deployed as domain-oriented agents, which rely on domain-oriented guidelines that may conflict with their commonsense knowledge. These guidelines exhibit two key characteristics: they consist of a wide range of domain-oriented rules and are subject to frequent updates. Despite these challenges, the absence of comprehensive benchmarks for evaluating the domain-oriented guideline following capabilities of LLMs presents a significant obstacle to their effective assessment and further development. In this paper, we introduce GuideBench, a comprehensive benchmark designed to evaluate guideline following performance of LLMs. GuideBench evaluates LLMs on three critical aspects: (i) adherence to diverse rules, (ii) robustness to rule updates, and (iii) alignment with human preferences. Experimental results on a range of LLMs indicate substantial opportunities for improving their ability to follow domain-oriented guidelines.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Jun-18-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - China > Shanghai
    - Shanghai (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)
- Europe > United Kingdom
  - England > Cambridgeshire > Cambridge (0.04)
- North America
  - Mexico > Mexico City
    - Mexico City (0.04)
  - United States > Florida
    - Miami-Dade County > Miami (0.04)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Consumer Products & Services (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.96)
  - Natural Language > Large Language Model (1.00)