An Approach to Grounding AI Model Evaluations in Human-derived Criteria
–arXiv.org Artificial Intelligence
In the rapidly evolving field of artificial intelligence (AI), traditional benchmarks can fall short in attempting to capture the nuanced capabilities of AI models. We focus on the case of physical world modeling and propose a novel approach to augment existing benchmarks with human-derived evaluation criteria, aiming to enhance the interpretability and applicability of model behaviors. Grounding our study in the Perception Test and OpenEQA benchmarks, we conducted in-depth interviews and large-scale surveys to identify key cognitive skills, such as Prioritization, Memorizing, Discerning, and Contextualizing, that are critical for both AI and human reasoning. Our findings reveal that participants perceive AI as lacking in interpretive and empathetic skills yet hold high expectations for AI performance. By integrating insights from our findings into benchmark design, we offer a framework for developing more human-aligned means of defining and measuring progress. This work underscores the importance of user-centered evaluation in AI development, providing actionable guidelines for researchers and practitioners aiming to align AI capabilities with human cognitive processes. Our approach both enhances current benchmarking practices and sets the stage for future advancements in AI model evaluation.
arXiv.org Artificial Intelligence
Sep-8-2025
- Country:
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
- Genre:
- Questionnaire & Opinion Survey (0.88)
- Research Report (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Cognitive Science (1.00)
- Machine Learning (1.00)
- Natural Language (0.94)
- Information Technology > Artificial Intelligence