AI Judges in Design: Statistical Perspectives on Achieving Human Expert Equivalence With Vision-Language Models

Edwards, Kristen M., Tehranchi, Farnaz, Miller, Scarlett R., Ahmed, Faez

Apr-1-2025–arXiv.org Artificial Intelligence

The subjective evaluation of early stage engineering designs, such as conceptual sketches, traditionally relies on human experts. However, expert evaluations are time-consuming, expensive, and sometimes inconsistent. Recent advances in vision-language models (VLMs) offer the potential to automate design assessments, but it is crucial to ensure that these AI ``judges'' perform on par with human experts. However, no existing framework assesses expert equivalence. This paper introduces a rigorous statistical framework to determine whether an AI judge's ratings match those of human experts. We apply this framework in a case study evaluating four VLM-based judges on key design metrics (uniqueness, creativity, usefulness, and drawing quality). These AI judges employ various in-context learning (ICL) techniques, including uni- vs. multimodal prompts and inference-time reasoning. The same statistical framework is used to assess three trained novices for expert-equivalence. Results show that the top-performing AI judge, using text- and image-based ICL with reasoning, achieves expert-level agreement for uniqueness and drawing quality and outperforms or matches trained novices across all metrics. In 6/6 runs for both uniqueness and creativity, and 5/6 runs for both drawing quality and usefulness, its agreement with experts meets or exceeds that of the majority of trained novices. These findings suggest that reasoning-supported VLM models can achieve human-expert equivalence in design evaluation. This has implications for scaling design evaluation in education and practice, and provides a general statistical framework for validating AI judges in other domains requiring subjective content evaluation.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Apr-1-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Pennsylvania > Centre County
    - University Park (0.04)
  - Massachusetts
    - Middlesex County > Cambridge (0.14)
    - Suffolk County > Boston (0.04)
  - Florida > Miami-Dade County
    - Miami (0.04)
  - Colorado > Boulder County
    - Boulder (0.04)
- Europe > United Kingdom
  - England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East
  - Jordan (0.04)
  - Iran > Tehran Province
    - Tehran (0.04)
- Africa > Guinea
  - Kankan Region > Kankan Prefecture > Kankan (0.04)

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study (0.92)

Industry:
- Education (0.45)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Applied AI (1.00)
  - Machine Learning > Statistical Learning (0.46)
  - Natural Language
    - Large Language Model (0.69)
    - Chatbot (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found