Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents

Xing, Fuyu, Wang, Zimu, Wang, Wei, Zhang, Haiyang

Sep-17-2025–arXiv.org Artificial Intelligence

The proliferation of multimedia content necessitates the development of effective Multimedia Event Extraction (M2E2) systems. Though Large Vision-Language Models (LVLMs) have shown strong cross-modal capabilities, their utility in the M2E2 task remains underexplored. In this paper, we present the first systematic evaluation of representative LVLMs, including DeepSeek-VL2 and the Qwen-VL series, on the M2E2 dataset. Our evaluations cover text-only, image-only, and cross-media subtasks, assessed under both few-shot prompting and fine-tuning settings. Our key findings highlight the following valuable insights: (1) Few-shot LVLMs perform notably better on visual tasks but struggle significantly with textual tasks; (2) Fine-tuning LVLMs with LoRA substantially enhances model performance; and (3) LVLMs exhibit strong synergy when combining modalities, achieving superior performance in cross-modal settings. We further provide a detailed error analysis to reveal persistent challenges in areas such as semantic precision, localization, and cross-modal grounding, which remain critical obstacles for advancing M2E2 capabilities.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Sep-17-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.14)

Genre:
- Research Report (1.00)

Industry:
- Health & Medicine (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language
    - Large Language Model (0.36)
    - Text Processing (0.34)