The ORCA Benchmark: Evaluating Real-World Calculation Accuracy in Large Language Models

Herambourg, Claudia, Siuda, Dawid, Kopczyńska, Julia, Santos, Joao R. L., Sas, Wojciech, Śmietańska-Nowak, Joanna

Nov-6-2025–arXiv.org Artificial Intelligence

We present ORCA (Omni Research on Calculation in AI) Benchmark - a novel benchmark that evaluates large language models (LLMs) on multi-domain, real-life quantitative reasoning using verified outputs from Omni's calculator engine. In 500 natural-language tasks across domains such as finance, physics, health, and statistics, the five state-of-the-art systems (ChatGPT-5, Gemini~2.5~Flash, Claude~Sonnet~4.5, Grok~4, and DeepSeek~V3.2) achieved only $45\text{--}63\,\%$ accuracy, with errors mainly related to rounding ($35\,\%$) and calculation mistakes ($33\,\%$). Results in specific domains indicate strengths in mathematics and engineering, but weaknesses in physics and natural sciences. Correlation analysis ($r \approx 0.40\text{--}0.65$) shows that the models often fail together but differ in the types of errors they make, highlighting their partial complementarity rather than redundancy. Unlike standard math datasets, ORCA evaluates step-by-step reasoning, numerical precision, and domain generalization across real problems from finance, physics, health, and statistics.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Nov-6-2025

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Education > Educational Setting (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)