VLM-SlideEval: Evaluating VLMs on Structured Comprehension and Perturbation Sensitivity in PPT
Kang, Hyeonsu, Bao, Emily, Goswami, Anjan
–arXiv.org Artificial Intelligence
Vision-language models (VLMs) are increasingly used to evaluate multimodal content, including presentation slides, yet their slide-specific understanding remains underexplored {despite their growing role as critics in agentic, model-forward pipelines}. We introduce VLM-SlideEval, an evaluation framework that probes VLMs along three axes: (1) element-level extraction from slide images aligned to ground truth; (2) robustness to controlled perturbations in geometry, style, and text; and (3) higher-level comprehension, such as recovering a deck's narrative order from shuffled slides. Using publicly available decks from Zenodo (https://huggingface.co/datasets/Forceless/Zenodo10K/viewer/default/pptx), we standardize ground-truth element metadata from PowerPoint XML and live renderings into a unified, verifiable schema. Empirically, VLMs underperform on pixel-accurate extraction and show non-trivial agreement, fidelity, and consistency under controlled perturbations, while performing better on single-slide content understanding; however, they do not reliably capture narrative structure across slides. These results highlight the limits of current VLMs for slide evaluation and motivate calibrated, critic-in-the-loop evaluators that drive iterative refinement and selection in agentic pipelines.
arXiv.org Artificial Intelligence
Oct-28-2025
- Country:
- Asia > South Korea (0.04)
- Europe > Monaco (0.04)
- North America
- Montserrat (0.04)
- United States
- California > San Francisco County
- San Francisco (0.14)
- District of Columbia > Washington (0.04)
- California > San Francisco County
- Genre:
- Research Report (0.64)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Natural Language > Large Language Model (0.53)
- Vision (1.00)
- Information Technology > Artificial Intelligence