BSBench: will your LLM find the largest prime number?

Jun-6-2025–arXiv.org Artificial Intelligence

With large language models' (LLMs) continued successes in achieving high scores on various benchmarks [Ope25; Dee+25; Ant25; Tea+25], there still remains a question of how well these scores translate into real-world performance. In the real world there are often questions with no solutions because problems are either underdetermined, overdetermined or simply ill-posed. The ability to ask right questions (and filter out the fluff before answering them) is arguably no less valuable than the ability to answer the questions with given answers. This is in a stark contrast with current benchmark evaluations (and training [Dee+25; Lam+25]) approach, which are supposed to be crafted carefully enough to have at least a single unambiguous solution. We propose that the models should be systematically tested for the existence of such a "bias", which, if present, might translate into models always trying to find a solution, even when the right thing is to say that the question is ill-posed, and in turn sabotage the potential for success of (semi-)autonomy envisioned for the agents built upon these models.

benchmark, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

Jun-6-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - Mongolia (0.05)
  - Kazakhstan (0.05)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.74)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found