MMDU: A Multi-T urn Multi-Image Dialog Understanding Benchmark and Instruction-T uning Dataset for L VLMs

Oct-9-2025, 18:43:22 GMT–Neural Information Processing Systems

Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models (L VLMs).

arxiv preprint arxiv, benchmark, dataset, (14 more...)

Neural Information Processing Systems

Oct-9-2025, 18:43:22 GMT

Conferences PDF

Country:
- Asia > China > Shanghai > Shanghai (0.04)

Genre:
- Research Report > Experimental Study (0.93)

Industry:
- Law (0.92)
- Government (0.92)
- Information Technology (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language
    - Large Language Model (0.98)
    - Chatbot (0.72)
  - Machine Learning > Neural Networks
    - Deep Learning (0.50)

Duplicate Docs Excel Report

Title
MMDU: AMulti-TurnMulti-ImageDialog UnderstandingBenchmarkand Instruction-Tuning DatasetforLVLMs

Similar Docs Excel Report more

Title	Similarity	Source
None found