EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing
Chen, Tianyu, Zhang, Yasi, Zhang, Zhi, Yu, Peiyu, Wang, Shu, Wang, Zhendong, Lin, Kevin, Wang, Xiaofei, Yang, Zhengyuan, Li, Linjie, Lin, Chung-Ching, Xie, Jianwen, Leong, Oscar, Wang, Lijuan, Wu, Ying Nian, Zhou, Mingyuan
–arXiv.org Artificial Intelligence
Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images-resulting in limited coverage and inheriting biases from prior generative models-or (ii) rely solely on zero-shot vision-language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise. To address this, we introduce EdiVal-Agent, an automated and fine-grained evaluation framework grounded in an object-centric perspective, designed to assess not only standard single-turn but also multi-turn instruction-based editing with precision. Given an input image, EdiVal-Agent first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions while dynamically updating object pools across turns. These two stages enable two novel object-centric metrics tailored for multi-turn evaluation and one global metric of visual quality: (1) EdiVal-IF, which measures instruction following by combining open-vocabulary object detectors for symbolic checks with VLMs for semantic verification on detector-guided crops; (2) EdiVal-CC, which evaluates content consistency by calculating semantic similarity of unchanged objects and background using the evolving object pools; and (3) EdiVal-VQ, which quantifies changes in overall visual quality with human preference models. Instantiating this pipeline, we build EdiVal-Bench, a multi-turn editing benchmark covering 9 instruction types and 13 state-of-the-art editing models spanning in-context, flow-matching, and diffusion paradigms. We demonstrate that EdiVal-Agent can be used to identify existing failure modes, thereby informing the development of the next generation of editing models.
arXiv.org Artificial Intelligence
Oct-17-2025
- Country:
- Europe > Switzerland (0.04)
- North America > United States
- California > Los Angeles County
- Los Angeles (0.14)
- Texas > Travis County
- Austin (0.04)
- California > Los Angeles County
- Genre:
- Research Report (0.64)
- Industry:
- Leisure & Entertainment > Sports (0.45)
- Technology: