SPHERE: An Evaluation Card for Human-AI Systems

Ma, Qianou, Zhao, Dora, Zhao, Xinran, Si, Chenglei, Yang, Chenyang, Louie, Ryan, Reiter, Ehud, Yang, Diyi, Wu, Tongshuang

Apr-14-2025–arXiv.org Artificial Intelligence

In the era of Large Language Models (LLMs), establishing effective evaluation methods and standards for diverse human-AI interaction systems is increasingly challenging. To encourage more transparent documentation and facilitate discussion on human-AI system evaluation design options, we present an evaluation card SPHERE, which encompasses five key dimensions: 1) What is being evaluated?; 2) How is the evaluation conducted?; 3) Who is participating in the evaluation?; 4) When is evaluation conducted?; 5) How is evaluation validated? We conduct a review of 39 human-AI systems using SPHERE, outlining current evaluation practices and areas for improvement. We provide three recommendations for improving the validity and rigor of evaluation practices.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Apr-14-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (1.00)
- Europe (1.00)
- Asia (1.00)

Genre:
- Research Report > Experimental Study (1.00)
- Questionnaire & Opinion Survey (0.94)

Industry:
- Health & Medicine (1.00)
- Education (1.00)
- Media (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Issues > Social & Ethical Issues (1.00)
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found