OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Mar-21-2026, 09:44:49 GMT–Neural Information Processing Systems

Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction.

artificial intelligence, name change, proceedings, (9 more...)

Neural Information Processing Systems

Mar-21-2026, 09:44:49 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence (1.00)