BSBench: will your LLM find the largest prime number?

Erziev, K. O. T.

arXiv.org Artificial Intelligence 

With large language models' (LLMs) continued successes in achieving high scores on various benchmarks [Ope25; Dee+25; Ant25; Tea+25], there still remains a question of how well these scores translate into real-world performance. In the real world there are often questions with no solutions because problems are either underdetermined, overdetermined or simply ill-posed. The ability to ask right questions (and filter out the fluff before answering them) is arguably no less valuable than the ability to answer the questions with given answers. This is in a stark contrast with current benchmark evaluations (and training [Dee+25; Lam+25]) approach, which are supposed to be crafted carefully enough to have at least a single unambiguous solution. We propose that the models should be systematically tested for the existence of such a "bias", which, if present, might translate into models always trying to find a solution, even when the right thing is to say that the question is ill-posed, and in turn sabotage the potential for success of (semi-)autonomy envisioned for the agents built upon these models.