A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

Neural Information Processing Systems 

Recent developments of vision large language models (LLMs) have seen remarkable progress, yet still encounter challenges towards multimodal generalists, such as coarse-grained instance-level understanding, lack of unified support for both images and videos, and insufficient coverage across various vision tasks.