General Scales Unlock AI Evaluation with Explanatory and Predictive Power

Zhou, Lexin, Pacchiardi, Lorenzo, Martínez-Plumed, Fernando, Collins, Katherine M., Moros-Daval, Yael, Zhang, Seraphina, Zhao, Qinlin, Huang, Yitian, Sun, Luning, Prunty, Jonathan E., Li, Zongqian, Sánchez-García, Pablo, Chen, Kexin Jiang, Casares, Pablo A. M., Zu, Jiyun, Burden, John, Mehrbakhsh, Behzad, Stillwell, David, Cebrian, Manuel, Wang, Jindong, Henderson, Peter, Wu, Sherry Tongshuang, Kyllonen, Patrick C., Cheke, Lucy, Xie, Xing, Hernández-Orallo, José

Mar-15-2025–arXiv.org Artificial Intelligence

Ensuring safe and effective use of AI requires understanding and anticipating its performance on novel tasks, from advanced scientific challenges to transformed workplace activities. So far, benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems, given the low transferability across diverse tasks. In this paper, we introduce general scales for AI evaluation that can explain what common AI benchmarks really measure, extract ability profiles of AI systems, and predict their performance for new task instances, in- and out-of-distribution. Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate. Illustrated for 15 large language models and 63 tasks, high explanatory power is unleashed from inspecting the demand and ability profiles, bringing insights on the sensitivity and specificity exhibited by different benchmarks, and how knowledge, metacognition and reasoning are affected by model size, chain-of-thought and distillation. Surprisingly, high predictive power at the instance level becomes possible using these demand levels, providing superior estimates over black-box baseline predictors based on embeddings or finetuning, especially in out-of-distribution settings (new tasks and new benchmarks). The scales, rubrics, battery, techniques and results presented here represent a major step for AI evaluation, underpinning the reliable deployment of AI in the years ahead. (Collaborative platform: https://kinds-of-intelligence-cfi.github.io/ADELE.)

data mining, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Mar-15-2025

arXiv.org PDF

Add feedback

Country:
- Africa > Eswatini
  - Manzini > Manzini (0.04)
- Asia
  - Indonesia > Bali (0.04)
  - Japan (0.04)
  - Middle East
    - Iran > Tehran Province
      - Tehran (0.04)
    - Jordan (0.04)
- Europe
  - Austria > Vienna (0.13)
  - Belgium > Flanders
    - Flemish Brabant > Leuven (0.04)
  - France (0.04)
  - Holy See (0.04)
  - Spain
    - Galicia > Madrid (0.04)
    - Valencian Community > Valencia Province
      - Valencia (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.28)
    - Oxfordshire > Oxford (0.04)
- North America
  - Canada > British Columbia
    - Metro Vancouver Regional District > Vancouver (0.04)
  - United States
    - California (0.04)
    - Massachusetts > Middlesex County
      - Reading (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.04)
    - New Jersey > Bergen County
      - Mahwah (0.04)
    - New York (0.04)
    - Pennsylvania > Philadelphia County
      - Philadelphia (0.04)
    - Texas (0.04)
- South America > Chile
  - Santiago Metropolitan Region > Santiago Province > Santiago (0.04)

Genre:
- Instructional Material (1.00)
- Overview (0.92)
- Questionnaire & Opinion Survey (0.92)
- Research Report
  - Experimental Study (0.67)
  - New Finding (0.92)

Industry:
- Media (0.67)
- Transportation (0.65)
- Banking & Finance (1.00)
- Education
  - Assessment & Standards (0.67)
  - Educational Setting
    - Higher Education (0.67)
    - K-12 Education (1.00)
- Health & Medicine
  - Consumer Health (0.92)
  - Pharmaceuticals & Biotechnology (1.00)
  - Therapeutic Area > Psychiatry/Psychology (0.67)
- Law (1.00)
- Energy (0.92)
- Leisure & Entertainment
  - Games (0.92)
  - Sports (1.00)
- Consumer Products & Services (1.00)
- Government > Regional Government (0.67)

Technology:
- Information Technology
  - Artificial Intelligence
    - Cognitive Science > Problem Solving (0.92)
    - Issues (0.92)
    - Machine Learning
      - Neural Networks > Deep Learning (1.00)
      - Performance Analysis (0.87)
      - Statistical Learning (1.00)
    - Natural Language
      - Chatbot (1.00)
      - Large Language Model (1.00)
    - Representation & Reasoning (1.00)
  - Data Science > Data Mining (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found