SPHERE: An Evaluation Card for Human-AI Systems
Ma, Qianou, Zhao, Dora, Zhao, Xinran, Si, Chenglei, Yang, Chenyang, Louie, Ryan, Reiter, Ehud, Yang, Diyi, Wu, Tongshuang
–arXiv.org Artificial Intelligence
In the era of Large Language Models (LLMs), establishing effective evaluation methods and standards for diverse human-AI interaction systems is increasingly challenging. To encourage more transparent documentation and facilitate discussion on human-AI system evaluation design options, we present an evaluation card SPHERE, which encompasses five key dimensions: 1) What is being evaluated?; 2) How is the evaluation conducted?; 3) Who is participating in the evaluation?; 4) When is evaluation conducted?; 5) How is evaluation validated? We conduct a review of 39 human-AI systems using SPHERE, outlining current evaluation practices and areas for improvement. We provide three recommendations for improving the validity and rigor of evaluation practices.
arXiv.org Artificial Intelligence
Apr-14-2025
- Country:
- Africa > Eswatini
- Asia
- Europe
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- France (0.04)
- Germany > Hamburg (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Middle East > Malta
- Eastern Region > Northern Harbour District > St. Julian's (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Croatia > Dubrovnik-Neretva County
- North America
- Canada > Ontario
- Toronto (0.04)
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- Florida > Miami-Dade County
- Miami (0.04)
- Hawaii > Honolulu County
- Honolulu (0.05)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Maryland > Montgomery County
- Gaithersburg (0.04)
- Florida > Miami-Dade County
- Canada > Ontario
- Genre:
- Questionnaire & Opinion Survey (0.94)
- Research Report > Experimental Study (1.00)
- Industry:
- Education (1.00)
- Health & Medicine (1.00)
- Media (0.68)
- Technology: