Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure
Gigant, Théo, Guinaudeau, Camille, Dufaux, Frédéric
–arXiv.org Artificial Intelligence
Vision-Language Models (VLMs) can process visual and textual information in multiple formats: texts, images, interleaved texts and images, or even hour-long videos. In this work, we conduct fine-grained quantitative and qualitative analyses of automatic summarization of multimodal presentations using VLMs with various representations as input. From these experiments, we suggest cost-effective strategies for generating summaries from text-heavy multimodal documents under different input-length budgets using VLMs. We show that slides extracted from the video stream can be beneficially used as input against the raw video, and that a structured representation from interleaved slides and transcript provides the best performance. Finally, we reflect and comment on the nature of cross-modal interactions in multimodal presentations and share suggestions to improve the capabilities of VLMs to understand documents of this nature.
arXiv.org Artificial Intelligence
Apr-15-2025
- Country:
- North America > United States (1.00)
- Europe (1.00)
- Genre:
- Overview (1.00)
- Industry:
- Law (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Representation & Reasoning (1.00)
- Natural Language > Large Language Model (0.94)
- Machine Learning > Neural Networks (0.67)
- Information Technology > Artificial Intelligence