On the Measure of a Model: From Intelligence to Generality

Dhar, Ruchira, Oldenburg, Ninell, Soegaard, Anders

Nov-18-2025–arXiv.org Artificial Intelligence

Benchmarks such as ARC, Raven-inspired tests, and the Blackbird Task are widely used to evaluate the intelligence of large language models (LLMs). Yet, the concept of intelligence remains elusive- lacking a stable definition and failing to predict performance on practical tasks such as question answering, summarization, or coding. Optimizing for such benchmarks risks misaligning evaluation with real-world utility. Our perspective is that evaluation should be grounded in generality rather than abstract notions of intelligence. We identify three assumptions that often underpin intelligence-focused evaluation: generality, stability, and realism. Through conceptual and formal analysis, we show that only generality withstands conceptual and empirical scrutiny. Intelligence is not what enables generality; generality is best understood as a multitask learning problem that directly links evaluation to measurable performance breadth and reliability. This perspective reframes how progress in AI should be assessed and proposes generality as a more stable foundation for evaluating capability across diverse and evolving tasks.

large language model, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

Nov-18-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.14)

Genre:
- Research Report (0.50)

Industry:
- Education (0.88)
- Health & Medicine > Therapeutic Area
  - Neurology (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Cognitive Science > Creativity & Intelligence (0.68)
  - Machine Learning (1.00)
  - Natural Language
    - Chatbot (0.68)
    - Large Language Model (1.00)