EgoNormia: Benchmarking Physical Social Norm Understanding
Rezaei, MohammadHossein, Fu, Yicheng, Cuvin, Phil, Ziems, Caleb, Zhang, Yanzhe, Zhu, Hao, Yang, Diyi
–arXiv.org Artificial Intelligence
Human activity is moderated by norms. However, machines are often trained without explicit supervision on norm understanding and reasoning, especially when the norms are grounded in a physical and social context. To improve and evaluate the normative reasoning capability of vision-language models (VLMs), we present EgoNormia $\|\epsilon\|$, consisting of 1,853 ego-centric videos of human interactions, each of which has two related questions evaluating both the prediction and justification of normative actions. The normative actions encompass seven categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility. To compile this dataset at scale, we propose a novel pipeline leveraging video sampling, automatic answer generation, filtering, and human validation. Our work demonstrates that current state-of-the-art vision-language models lack robust norm understanding, scoring a maximum of 45% on EgoNormia (versus a human bench of 92%). Our analysis of performance in each dimension highlights the significant risks of safety, privacy, and the lack of collaboration and communication capability when applied to real-world agents. We additionally show that through a retrieval-based generation method, it is possible to use EgoNormia to enhance normative reasoning in VLMs.
arXiv.org Artificial Intelligence
Mar-5-2025
- Country:
- North America
- United States
- Arizona (0.04)
- California > Santa Clara County
- Palo Alto (0.04)
- Canada > Ontario
- Toronto (0.14)
- United States
- Europe
- Italy > Tuscany
- Florence (0.04)
- Germany > Bavaria
- Upper Bavaria > Munich (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Italy > Tuscany
- Asia
- North America
- Genre:
- Research Report (0.82)
- Industry:
- Information Technology > Security & Privacy (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Representation & Reasoning > Agents (1.00)
- Natural Language
- Large Language Model (1.00)
- Chatbot (1.00)
- Machine Learning > Neural Networks
- Deep Learning (1.00)
- Information Technology > Artificial Intelligence