Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks
Ganguly, Debargha, Singh, Vikash, Sankar, Sreehari, Zhang, Biyao, Zhang, Xuecen, Iyengar, Srinivasan, Han, Xiaotian, Sharma, Amit, Kalyanaraman, Shivkumar, Chaudhary, Vipin
–arXiv.org Artificial Intelligence
Large language models (LLMs) show remarkable promise for democratizing automated reasoning by generating formal specifications. However, a fundamental tension exists: LLMs are probabilistic, while formal verification demands deterministic guarantees. This paper addresses this epistemological gap by comprehensively investigating failure modes and uncertainty quantification (UQ) in LLM-generated formal artifacts. Our systematic evaluation of five frontier LLMs reveals Satisfiability Modulo Theories (SMT) based autoformalization's domain-specific impact on accuracy (from +34.8% on logical tasks to -44.5% on factual ones), with known UQ techniques like the entropy of token probabilities failing to identify these errors. We introduce a probabilistic context-free grammar (PCFG) framework to model LLM outputs, yielding a refined uncertainty taxonomy. We find uncertainty signals are task-dependent (e.g., grammar entropy for logic, AUROC>0.93). Finally, a lightweight fusion of these signals enables selective verification, drastically reducing errors (14-100%) with minimal abstention, transforming LLM-driven formalization into a reliable engineering discipline.
arXiv.org Artificial Intelligence
May-27-2025
- Country:
- Asia
- Middle East
- Israel > Haifa District
- Haifa (0.04)
- Jordan (0.04)
- Israel > Haifa District
- Thailand > Bangkok
- Bangkok (0.04)
- Middle East
- Europe > United Kingdom
- England > Cambridgeshire > Cambridge (0.14)
- North America > United States
- Florida > Miami-Dade County > Miami (0.04)
- Asia
- Genre:
- Research Report > New Finding (0.45)
- Technology: