MMDU: A Multi-T urn Multi-Image Dialog Understanding Benchmark and Instruction-T uning Dataset for L VLMs
–Neural Information Processing Systems
Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models (L VLMs).
Neural Information Processing Systems
Oct-9-2025, 18:43:22 GMT
- Genre:
- Research Report > Experimental Study (0.93)
- Industry:
- Government (0.92)
- Information Technology (0.67)
- Law (0.92)
- Technology: