c1f7b1ed763e9c75e4db74b49b76db5f-Supplemental-Conference.pdf

Neural Information Processing Systems 

Supplementary Materials for "VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks" Here, we show some examples of instructions for task-level customization, including object detection, instance segmentation, visual grounding, image captioning, and visual question answering (VQA). Following various instructions, our model can elegantly switch among different vision-centric tasks and accomplish them in a unified manner like LLMs. A.1 Object Detection Example 1. "Please examine the image and identify all objects in the category set . For each object, specify its location within the range by determining the top-left and bottom-right corners of its bounding box. To indicate the object's class and location, provide the output in the format (c, x1, y1, x2, y2), where'c' represents the class index starting from 0, and (x1, y1, x2, y2) correspond to the offsets of the bounding box corners relative to the center point. The image is: " ...