MMDU: A Multi-T urn Multi-Image Dialog Understanding Benchmark and Instruction-T uning Dataset for L VLMs

Neural Information Processing Systems 

Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models (L VLMs).

Similar Docs  Excel Report  more

TitleSimilaritySource
None found