Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models
Tuck, Bryan E., Verma, Rakesh M.
–arXiv.org Artificial Intelligence
Large language models must satisfy hard orthographic constraints during controlled text generation, yet systematic cross-architecture evaluation remains limited. We evaluate 28 configurations spanning three model families (Qwen3, Claude Haiku-4.5, GPT-5-mini) on 58 word puzzles requiring character-level constraint satisfaction. Architectural differences produce substantially larger performance gaps (2.0-2.2x, F1=0.761 vs. 0.343) than parameter scaling within families (83% gain from eightfold scaling), suggesting that constraint satisfaction may require specialized architectural features or training objectives beyond standard language model scaling. Thinking budget sensitivity proves heterogeneous: high-capacity models show strong returns (+0.102 to +0.136 F1), while mid-sized variants saturate or degrade. These patterns are inconsistent with uniform compute benefits. Using difficulty ratings from 10,000 human solvers per puzzle, we establish modest but consistent calibration (r=0.24-0.38) across all families, yet identify systematic failures on common words with unusual orthography ("data", "poop", "loll": 86-95% human success, 89-96% model miss rate). These failures reveal over-reliance on distributional plausibility that penalizes orthographically atypical but constraint-valid patterns, suggesting architectural innovations may be required beyond simply scaling parameters or computational budgets.
arXiv.org Artificial Intelligence
Nov-27-2025
- Country:
- Asia > Middle East
- Jordan (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.14)
- Europe
- Austria > Vienna (0.14)
- Denmark > Southern Denmark (0.04)
- North America
- Canada > British Columbia
- United States
- Florida > Miami-Dade County
- Miami (0.14)
- Texas > Harris County
- Houston (0.04)
- Washington > King County
- Seattle (0.04)
- Florida > Miami-Dade County
- Asia > Middle East
- Genre:
- Research Report (0.64)