A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
–Neural Information Processing Systems
Recent developments of vision large language models (LLMs) have seen remarkable progress, yet still encounter challenges towards multimodal generalists, such as coarse-grained instance-level understanding, lack of unified support for both images and videos, and insufficient coverage across various vision tasks.
Neural Information Processing Systems
Mar-21-2025, 20:46:52 GMT