DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

Neural Information Processing Systems 

Existing Multimodal Large Language Models (MLLMs) increasingly emphasize complex understanding of various visual elements, including multiple objects, text information, and spatial relations.