Control Illusion: The Failure of Instruction Hierarchies in Large Language Models

Geng, Yilin, Li, Haonan, Mu, Honglin, Han, Xudong, Baldwin, Timothy, Abend, Omri, Hovy, Eduard, Frermann, Lea

Feb-20-2025–arXiv.org Artificial Intelligence

Large language models (LLMs) are increasingly deployed with hierarchical instruction schemes, where certain instructions (e.g., system-level directives) are expected to take precedence over others (e.g., user messages). Yet, we lack a systematic understanding of how effectively these hierarchical control mechanisms work. We introduce a systematic evaluation framework based on constraint prioritization to assess how well LLMs enforce instruction hierarchies. Our experiments across six state-of-the-art LLMs reveal that models struggle with consistent instruction prioritization, even for simple formatting conflicts. We find that the widely-adopted system/user prompt separation fails to establish a reliable instruction hierarchy, and models exhibit strong inherent biases toward certain constraint types regardless of their priority designation. While controlled prompt engineering and model fine-tuning show modest improvements, our results indicate that instruction hierarchy enforcement is not robustly realized, calling for deeper architectural innovations beyond surface-level modifications.

constraint, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

Feb-20-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Japan
  - Honshū > Kansai (0.14)
- North America (0.93)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.69)
  - Natural Language > Large Language Model (1.00)