PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation

Wahed, Muntasir, Nguyen, Kiet A., Juvekar, Adheesh Sunil, Li, Xinzhuo, Zhou, Xiaona, Shah, Vedant, Yu, Tianjiao, Yanardag, Pinar, Lourentzou, Ismini

arXiv.org Artificial Intelligence 

Despite significant advancements in Large Vision-Language Models (LVLMs), existing pixel-grounding models operate on single-image settings, limiting their ability to perform detailed, fine-grained comparisons across multiple images. Conversely, current multi-image understanding models lack pixel-level grounding. Our work addresses this gap by introducing the task of multi-image pixel-grounded reasoning segmentation, and PRIMA, a novel LVLM that integrates pixel-level grounding with robust multi-image reasoning capabilities to produce contextually rich, pixel-grounded explanations. Central to PRIMA is an efficient vision module that queries fine-grained visual representations across multiple images, reducing TFLOPs by $25.3\%$. To support training and evaluation, we curate $M^4Seg$, a new reasoning segmentation benchmark consisting of $\sim$224K question-answer pairs that require fine-grained visual understanding across multiple images. Experimental results demonstrate PRIMA outperforms state-of-the-art baselines.