OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
–Neural Information Processing Systems
Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction.
Neural Information Processing Systems
Mar-21-2026, 09:44:49 GMT
- Technology: